PangeaInstruct
收藏魔搭社区2026-01-06 更新2024-10-26 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/PangeaInstruct
下载链接
链接失效反馈官方服务:
资源简介:
# PangeaInstruct
[Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages](https://neulab.github.io/Pangea/)
🇪🇹 🇸🇦 🇧🇬 🇧🇩 🇨🇿 🇩🇪 🇬🇷 🇬🇧 🇺🇸 🇪🇸 🇮🇷 🇫🇷 🇮🇪 🇮🇳 🇮🇩 🇳🇬 🇮🇹 🇮🇱 🇯🇵 🇮🇩 🇰🇷 🇳🇱 🇲🇳 🇲🇾 🇳🇴 🇵🇱 🇵🇹 🇧🇷 🇷🇴 🇷🇺 🇱🇰 🇮🇩 🇰🇪 🇹🇿 🇱🇰 🇮🇳 🇮🇳 🇹🇭 🇹🇷 🇺🇦 🇵🇰 🇮🇳 🇻🇳 🇨🇳 🇹🇼
[🏠 Homepage](https://neulab.github.io/Pangea/) | [🤖 Pangea-7B](https://huggingface.co/neulab/Pangea-7B) | [📊 PangeaIns](https://huggingface.co/datasets/neulab/PangeaInstruct) | [🧪 PangeaBench](https://huggingface.co/collections/neulab/pangea-6713c3b0d78a453906eb2ed8) | [💻 Github](https://github.com/neulab/Pangea/tree/main) | [📄 Arxiv](https://arxiv.org/abs/2410.16153) | [📕 PDF](https://arxiv.org/pdf/2410.16153) | [🖥️ Demo](https://huggingface.co/spaces/neulab/Pangea)
<img src="https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/ZjVTKnIsyshWpo-PWg9gM.png" alt="description" style="width:300px;">
This README provides comprehensive details on the PangeaIns dataset, which was utilized during the instruction tuning phase for [Pangea-7B](https://huggingface.co/neulab/Pangea-7B).
## Description of PangeaIns
PangeaIns is a 6M multilingual multicultural multimodal instruction tuning dataset spanning 39 languages.
## PangeaIns Data Source
PangeaIns data path: PangeaIns.json (# samples: 6450624)
PangeaIns data source:
| Dataset Name | Dataset Path | # Samples |
|-----------------------------|--------------------------------------------------------------|-----------|
| ALLAVA-4V | general/ALLAVA-4V/data.json | 621327 |
| allava_vflan | general/allava_vflan/data.json | 325122 |
| Cambrian737k | general/cambrian/data.json | 736934 |
| ChartQA | doc+chart/ChartQA/data.json | 28299 |
| Code-Feedback | text-only/Code-Feedback/data.json | 20000 |
| doc-vqa | doc+chart/doc-vqa/data.json | 9665 |
| gpt4v-dataset | caption/gpt4v-dataset/data.json | 10822 |
| GQA-ru | general/GQA-ru/data.json | 40000 |
| laion-1M-qa | cultural/laion-multi-1M/captions-1M-generated-qas-llava.json | 1028791 |
| laion-300K-caption | cultural/laion-multi-1M/laion-300K-caption-llava.json | 300000 |
| llava-en-zh-300k | general/llava-en-zh-300k/data.json | 50000 |
| LLaVA-Finetune | cultural/laion-cultural-150k/laion-cultural-150k.json | 151072 |
| Llava-JP-Instruct-108K | general/LLaVA-JP-Instruct-108K/data.json | 108855 |
| llava-med-zh-instruct-60K | general/llava-med-zh-instruct-60k/data.json | 56649 |
| LLaVA-NeXt | general/LLaVA-NeXt-Data/data.json | 119853 |
| LVIS-Instruct4V | general/LVIS-Instruct4V/data.json | 222697 |
| MTVQA | general/MTVQA/data.json | 6678 |
| nvlr2-llava | general/nvlr2-llava/data.json | 86373 |
| NuminaMath-CoT | text-only/NuminaMath-CoT/data.json | 100000 |
| OpenHermes-2.5 | text-only/Openhermes-2.5/data.json | 399900 |
| palo_multilingual_dataset | general/palo_multilingual_dataset/urdu-100k.json | 99992 |
| ShareGPT-4o | general/ShareGPT-4o/data.json | 57289 |
| ShareGPT4V | general/ShareGPT4V/data.json | 91021 |
| STAIR-Captions | caption/STAIR-Captions/data.json | 82783 |
| table-vqa | doc+chart/table-vqa/data.json | 16408 |
| Viet-Doc-VQA | doc+chart/Viet-Doc-VQA/data.json | 12000 |
| Viet-DOC-VQA-II | doc+chart/Viet-DOC-VQA-II/data.json | 14998 |
| Viet-OCR-VQA | doc+chart/Viet-OCR-VQA/data.json | 30000 |
| Viet-ShareGPT-4o-Text-VQA | general/Viet-ShareGPT-4o-Text-VQA/data.json | 42678 |
| webui_multilingual_ocr | ocr/webui_multilingual_ocr/data.json | 300000 |
| translation | translation/data.json | 1280328 |
## Applications
PangeaIns was designed specifically for training the Pangea-7B model.
### Code Instructions
The dataset follows the LLaVA data format. To retrieve all files from PangeaIns, use the following script:
```python
from huggingface_hub import HfApi, hf_hub_download
import json
# Initialize the API client
api = HfApi()
dataset_name = "neulab/PangeaInstruct"
# Retrieve and download all files in the dataset
files = api.list_repo_files(repo_id=dataset_name, repo_type="dataset")
for file in files:
hf_hub_download(repo_id=dataset_name, filename=file, repo_type="dataset")
print(f"File downloaded: {file}")
# Load the complete PangeaIns dataset
with open('PangeaIns.json') as f:
data = json.load(f)
```
Please note that image data is provided in compressed formats such as `.tar` or `.zip`. After downloading, you may need to extract these files to access the images.
For images.tar files, you could untar them by running
```bash
tar -xvf images.tar
```
For images.zip files, you could unzip them by running
```bash
unzip images.zip
```
For some large tar files, we uploaded tar files splitted using the split command, such as `split -n 4 -d images.tar part_`.
For example, in the `cultural/laion-multi-1M` subset, we splitted the images.tar file into 4 parts, `part_00`, `part_01`, `part_02`, and `part_03`.
In such cases, you would need to first combine the splits and then extract the tar file.
```bash
cat part_* > images.tar
tar -xvf images.tar
```
Each subset within the PangeaIns dataset (e.g., ChartQA) contains a `.json` file for metadata and a corresponding `.tar/.zip` file for the images.
## Citing the Dataset
**BibTeX Citation:**
```
@article{yue2024pangeafullyopenmultilingual,
title={Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages},
author={Xiang Yue and Yueqi Song and Akari Asai and Seungone Kim and Jean de Dieu Nyandwi and Simran Khanuja and Anjali Kantharuban and Lintang Sutawika and Sathyanarayanan Ramamoorthy and Graham Neubig},
year={2024},
journal={arXiv preprint arXiv:2410.16153},
url={https://arxiv.org/abs/2410.16153}
}
```
## Contact
Corresponding to: {xyue2,yueqis,gneubig}@cs.cmu.edu
# PangeaInstruct
[Pangea:面向39种语言的全开源多语言多模态大语言模型(Multilingual Multimodal LLM)](https://neulab.github.io/Pangea/)
🇪🇹 🇸🇦 🇧🇬 🇧🇩 🇨🇿 🇩🇪 🇬🇷 🇬🇧 🇺🇸 🇪🇸 🇮🇷 🇫🇷 🇮🇪 🇮🇳 🇮🇩 🇳🇬 🇮🇹 🇮🇱 🇯🇵 🇮🇩 🇰🇷 🇳🇱 🇲🇳 🇲🇾 🇳🇴 🇵🇱 🇵🇹 🇧🇷 🇷🇴 🇷🇺 🇱🇰 🇮🇩 🇰🇪 🇹🇿 🇱🇰 🇮🇳 🇮🇳 🇹🇭 🇹🇷 🇺🇦 🇵🇰 🇮🇳 🇻🇳 🇨🇳 🇹🇼
[🏠 主页](https://neulab.github.io/Pangea/) | [🤖 Pangea-7B](https://huggingface.co/neulab/Pangea-7B) | [📊 PangeaIns](https://huggingface.co/datasets/neulab/PangeaInstruct) | [🧪 PangeaBench](https://huggingface.co/collections/neulab/pangea-6713c3b0d78a453906eb2ed8) | [💻 GitHub](https://github.com/neulab/Pangea/tree/main) | [📄 arXiv论文](https://arxiv.org/abs/2410.16153) | [📕 论文PDF](https://arxiv.org/pdf/2410.16153) | [🖥️ 在线演示](https://huggingface.co/spaces/neulab/Pangea)
<img src="https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/ZjVTKnIsyshWpo-PWg9gM.png" alt="description" style="width:300px;">
本README详细介绍了PangeaIns数据集,该数据集用于[Pangea-7B](https://huggingface.co/neulab/Pangea-7B)的指令微调阶段。
## PangeaIns数据集说明
PangeaIns是一个涵盖39种语言、规模达600万条的多语言多模态指令微调数据集。
## PangeaIns数据集来源
PangeaIns数据集路径:PangeaIns.json(样本量:6450624)
PangeaIns数据集来源:
| 数据集名称 | 数据集路径 | 样本数量 |
|-----------------------------|--------------------------------------------------------------|-----------|
| ALLAVA-4V | general/ALLAVA-4V/data.json | 621327 |
| allava_vflan | general/allava_vflan/data.json | 325122 |
| Cambrian737k | general/cambrian/data.json | 736934 |
| ChartQA | doc+chart/ChartQA/data.json | 28299 |
| Code-Feedback | text-only/Code-Feedback/data.json | 20000 |
| doc-vqa | doc+chart/doc-vqa/data.json | 9665 |
| gpt4v-dataset | caption/gpt4v-dataset/data.json | 10822 |
| GQA-ru | general/GQA-ru/data.json | 40000 |
| laion-1M-qa | cultural/laion-multi-1M/captions-1M-generated-qas-llava.json | 1028791 |
| laion-300K-caption | cultural/laion-multi-1M/laion-300K-caption-llava.json | 300000 |
| llava-en-zh-300k | general/llava-en-zh-300k/data.json | 50000 |
| LLaVA-Finetune | cultural/laion-cultural-150k/laion-cultural-150k.json | 151072 |
| Llava-JP-Instruct-108K | general/LLaVA-JP-Instruct-108K/data.json | 108855 |
| llava-med-zh-instruct-60K | general/llava-med-zh-instruct-60k/data.json | 56649 |
| LLaVA-NeXt | general/LLaVA-NeXt-Data/data.json | 119853 |
| LVIS-Instruct4V | general/LVIS-Instruct4V/data.json | 222697 |
| MTVQA | general/MTVQA/data.json | 6678 |
| nvlr2-llava | general/nvlr2-llava/data.json | 86373 |
| NuminaMath-CoT | text-only/NuminaMath-CoT/data.json | 100000 |
| OpenHermes-2.5 | text-only/Openhermes-2.5/data.json | 399900 |
| palo_multilingual_dataset | general/palo_multilingual_dataset/urdu-100k.json | 99992 |
| ShareGPT-4o | general/ShareGPT-4o/data.json | 57289 |
| ShareGPT4V | general/ShareGPT4V/data.json | 91021 |
| STAIR-Captions | caption/STAIR-Captions/data.json | 82783 |
| table-vqa | doc+chart/table-vqa/data.json | 16408 |
| Viet-Doc-VQA | doc+chart/Viet-Doc-VQA/data.json | 12000 |
| Viet-DOC-VQA-II | doc+chart/Viet-DOC-VQA-II/data.json | 14998 |
| Viet-OCR-VQA | doc+chart/Viet-OCR-VQA/data.json | 30000 |
| Viet-ShareGPT-4o-Text-VQA | general/Viet-ShareGPT-4o-Text-VQA/data.json | 42678 |
| webui_multilingual_ocr | ocr/webui_multilingual_ocr/data.json | 300000 |
| translation | translation/data.json | 1280328 |
## 应用场景
PangeaIns专为训练Pangea-7B模型设计。
### 代码指令说明
该数据集遵循LLaVA数据格式。如需获取PangeaIns的全部文件,请使用以下脚本:
python
from huggingface_hub import HfApi, hf_hub_download
import json
# 初始化API客户端
api = HfApi()
dataset_name = "neulab/PangeaInstruct"
# 获取并下载数据集中的所有文件
files = api.list_repo_files(repo_id=dataset_name, repo_type="dataset")
for file in files:
hf_hub_download(repo_id=dataset_name, filename=file, repo_type="dataset")
print(f"已下载文件:{file}")
# 加载完整的PangeaIns数据集
with open('PangeaIns.json') as f:
data = json.load(f)
请注意,图像数据以`.tar`或`.zip`等压缩格式提供。下载完成后,您可能需要解压这些文件才能访问其中的图像。
针对images.tar文件,可通过以下命令解压:
bash
tar -xvf images.tar
针对images.zip文件,可通过以下命令解压:
bash
unzip images.zip
对于部分大型tar文件,我们使用`split`命令将其拆分,例如`split -n 4 -d images.tar part_`。
例如在`cultural/laion-multi-1M`子集中,我们将images.tar文件拆分为4个部分:`part_00`、`part_01`、`part_02`和`part_03`。
此种情况下,您需要先合并拆分的文件,再解压tar包:
bash
cat part_* > images.tar
tar -xvf images.tar
PangeaIns数据集的每个子集(如ChartQA)均包含用于元数据的`.json`文件,以及对应存储图像的`.tar/.zip`文件。
## 数据集引用
**BibTeX引用格式:**
@article{yue2024pangeafullyopenmultilingual,
title={Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages},
author={Xiang Yue and Yueqi Song and Akari Asai and Seungone Kim and Jean de Dieu Nyandwi and Simran Khanuja and Anjali Kantharuban and Lintang Sutawika and Sathyanarayanan Ramamoorthy and Graham Neubig},
year={2024},
journal={arXiv preprint arXiv:2410.16153},
url={https://arxiv.org/abs/2410.16153}
}
## 联系方式
对应联系人:{xyue2,yueqis,gneubig}@cs.cmu.edu
提供机构:
maas
创建时间:
2024-10-24



