下载链接：

https://modelscope.cn/datasets/AI-ModelScope/PangeaInstruct

下载链接

链接失效反馈

官方服务：

资源简介：

# PangeaInstruct [Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages](https://neulab.github.io/Pangea/) 🇪🇹 🇸🇦 🇧🇬 🇧🇩 🇨🇿 🇩🇪 🇬🇷 🇬🇧 🇺🇸 🇪🇸 🇮🇷 🇫🇷 🇮🇪 🇮🇳 🇮🇩 🇳🇬 🇮🇹 🇮🇱 🇯🇵 🇮🇩 🇰🇷 🇳🇱 🇲🇳 🇲🇾 🇳🇴 🇵🇱 🇵🇹 🇧🇷 🇷🇴 🇷🇺 🇱🇰 🇮🇩 🇰🇪 🇹🇿 🇱🇰 🇮🇳 🇮🇳 🇹🇭 🇹🇷 🇺🇦 🇵🇰 🇮🇳 🇻🇳 🇨🇳 🇹🇼 [🏠 Homepage](https://neulab.github.io/Pangea/) | [🤖 Pangea-7B](https://huggingface.co/neulab/Pangea-7B) | [📊 PangeaIns](https://huggingface.co/datasets/neulab/PangeaInstruct) | [🧪 PangeaBench](https://huggingface.co/collections/neulab/pangea-6713c3b0d78a453906eb2ed8) | [💻 Github](https://github.com/neulab/Pangea/tree/main) | [📄 Arxiv](https://arxiv.org/abs/2410.16153) | [📕 PDF](https://arxiv.org/pdf/2410.16153) | [🖥️ Demo](https://huggingface.co/spaces/neulab/Pangea) <img src="https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/ZjVTKnIsyshWpo-PWg9gM.png" alt="description" style="width:300px;"> This README provides comprehensive details on the PangeaIns dataset, which was utilized during the instruction tuning phase for [Pangea-7B](https://huggingface.co/neulab/Pangea-7B). ## Description of PangeaIns PangeaIns is a 6M multilingual multicultural multimodal instruction tuning dataset spanning 39 languages. ## PangeaIns Data Source PangeaIns data path: PangeaIns.json (# samples: 6450624) PangeaIns data source: | Dataset Name | Dataset Path | # Samples | |-----------------------------|--------------------------------------------------------------|-----------| | ALLAVA-4V | general/ALLAVA-4V/data.json | 621327 | | allava_vflan | general/allava_vflan/data.json | 325122 | | Cambrian737k | general/cambrian/data.json | 736934 | | ChartQA | doc+chart/ChartQA/data.json | 28299 | | Code-Feedback | text-only/Code-Feedback/data.json | 20000 | | doc-vqa | doc+chart/doc-vqa/data.json | 9665 | | gpt4v-dataset | caption/gpt4v-dataset/data.json | 10822 | | GQA-ru | general/GQA-ru/data.json | 40000 | | laion-1M-qa | cultural/laion-multi-1M/captions-1M-generated-qas-llava.json | 1028791 | | laion-300K-caption | cultural/laion-multi-1M/laion-300K-caption-llava.json | 300000 | | llava-en-zh-300k | general/llava-en-zh-300k/data.json | 50000 | | LLaVA-Finetune | cultural/laion-cultural-150k/laion-cultural-150k.json | 151072 | | Llava-JP-Instruct-108K | general/LLaVA-JP-Instruct-108K/data.json | 108855 | | llava-med-zh-instruct-60K | general/llava-med-zh-instruct-60k/data.json | 56649 | | LLaVA-NeXt | general/LLaVA-NeXt-Data/data.json | 119853 | | LVIS-Instruct4V | general/LVIS-Instruct4V/data.json | 222697 | | MTVQA | general/MTVQA/data.json | 6678 | | nvlr2-llava | general/nvlr2-llava/data.json | 86373 | | NuminaMath-CoT | text-only/NuminaMath-CoT/data.json | 100000 | | OpenHermes-2.5 | text-only/Openhermes-2.5/data.json | 399900 | | palo_multilingual_dataset | general/palo_multilingual_dataset/urdu-100k.json | 99992 | | ShareGPT-4o | general/ShareGPT-4o/data.json | 57289 | | ShareGPT4V | general/ShareGPT4V/data.json | 91021 | | STAIR-Captions | caption/STAIR-Captions/data.json | 82783 | | table-vqa | doc+chart/table-vqa/data.json | 16408 | | Viet-Doc-VQA | doc+chart/Viet-Doc-VQA/data.json | 12000 | | Viet-DOC-VQA-II | doc+chart/Viet-DOC-VQA-II/data.json | 14998 | | Viet-OCR-VQA | doc+chart/Viet-OCR-VQA/data.json | 30000 | | Viet-ShareGPT-4o-Text-VQA | general/Viet-ShareGPT-4o-Text-VQA/data.json | 42678 | | webui_multilingual_ocr | ocr/webui_multilingual_ocr/data.json | 300000 | | translation | translation/data.json | 1280328 | ## Applications PangeaIns was designed specifically for training the Pangea-7B model. ### Code Instructions The dataset follows the LLaVA data format. To retrieve all files from PangeaIns, use the following script: ```python from huggingface_hub import HfApi, hf_hub_download import json # Initialize the API client api = HfApi() dataset_name = "neulab/PangeaInstruct" # Retrieve and download all files in the dataset files = api.list_repo_files(repo_id=dataset_name, repo_type="dataset") for file in files: hf_hub_download(repo_id=dataset_name, filename=file, repo_type="dataset") print(f"File downloaded: {file}") # Load the complete PangeaIns dataset with open('PangeaIns.json') as f: data = json.load(f) ``` Please note that image data is provided in compressed formats such as `.tar` or `.zip`. After downloading, you may need to extract these files to access the images. For images.tar files, you could untar them by running ```bash tar -xvf images.tar ``` For images.zip files, you could unzip them by running ```bash unzip images.zip ``` For some large tar files, we uploaded tar files splitted using the split command, such as `split -n 4 -d images.tar part_`. For example, in the `cultural/laion-multi-1M` subset, we splitted the images.tar file into 4 parts, `part_00`, `part_01`, `part_02`, and `part_03`. In such cases, you would need to first combine the splits and then extract the tar file. ```bash cat part_* > images.tar tar -xvf images.tar ``` Each subset within the PangeaIns dataset (e.g., ChartQA) contains a `.json` file for metadata and a corresponding `.tar/.zip` file for the images. ## Citing the Dataset **BibTeX Citation:** ``` @article{yue2024pangeafullyopenmultilingual, title={Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages}, author={Xiang Yue and Yueqi Song and Akari Asai and Seungone Kim and Jean de Dieu Nyandwi and Simran Khanuja and Anjali Kantharuban and Lintang Sutawika and Sathyanarayanan Ramamoorthy and Graham Neubig}, year={2024}, journal={arXiv preprint arXiv:2410.16153}, url={https://arxiv.org/abs/2410.16153} } ``` ## Contact Corresponding to: {xyue2,yueqis,gneubig}@cs.cmu.edu

# PangeaInstruct [Pangea：面向39种语言的全开源多语言多模态大语言模型（Multilingual Multimodal LLM）](https://neulab.github.io/Pangea/) 🇪🇹 🇸🇦 🇧🇬 🇧🇩 🇨🇿 🇩🇪 🇬🇷 🇬🇧 🇺🇸 🇪🇸 🇮🇷 🇫🇷 🇮🇪 🇮🇳 🇮🇩 🇳🇬 🇮🇹 🇮🇱 🇯🇵 🇮🇩 🇰🇷 🇳🇱 🇲🇳 🇲🇾 🇳🇴 🇵🇱 🇵🇹 🇧🇷 🇷🇴 🇷🇺 🇱🇰 🇮🇩 🇰🇪 🇹🇿 🇱🇰 🇮🇳 🇮🇳 🇹🇭 🇹🇷 🇺🇦 🇵🇰 🇮🇳 🇻🇳 🇨🇳 🇹🇼 [🏠 主页](https://neulab.github.io/Pangea/) | [🤖 Pangea-7B](https://huggingface.co/neulab/Pangea-7B) | [📊 PangeaIns](https://huggingface.co/datasets/neulab/PangeaInstruct) | [🧪 PangeaBench](https://huggingface.co/collections/neulab/pangea-6713c3b0d78a453906eb2ed8) | [💻 GitHub](https://github.com/neulab/Pangea/tree/main) | [📄 arXiv论文](https://arxiv.org/abs/2410.16153) | [📕 论文PDF](https://arxiv.org/pdf/2410.16153) | [🖥️ 在线演示](https://huggingface.co/spaces/neulab/Pangea) <img src="https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/ZjVTKnIsyshWpo-PWg9gM.png" alt="description" style="width:300px;"> 本README详细介绍了PangeaIns数据集，该数据集用于[Pangea-7B](https://huggingface.co/neulab/Pangea-7B)的指令微调阶段。 ## PangeaIns数据集说明 PangeaIns是一个涵盖39种语言、规模达600万条的多语言多模态指令微调数据集。 ## PangeaIns数据集来源 PangeaIns数据集路径：PangeaIns.json（样本量：6450624） PangeaIns数据集来源： | 数据集名称 | 数据集路径 | 样本数量 | |-----------------------------|--------------------------------------------------------------|-----------| | ALLAVA-4V | general/ALLAVA-4V/data.json | 621327 | | allava_vflan | general/allava_vflan/data.json | 325122 | | Cambrian737k | general/cambrian/data.json | 736934 | | ChartQA | doc+chart/ChartQA/data.json | 28299 | | Code-Feedback | text-only/Code-Feedback/data.json | 20000 | | doc-vqa | doc+chart/doc-vqa/data.json | 9665 | | gpt4v-dataset | caption/gpt4v-dataset/data.json | 10822 | | GQA-ru | general/GQA-ru/data.json | 40000 | | laion-1M-qa | cultural/laion-multi-1M/captions-1M-generated-qas-llava.json | 1028791 | | laion-300K-caption | cultural/laion-multi-1M/laion-300K-caption-llava.json | 300000 | | llava-en-zh-300k | general/llava-en-zh-300k/data.json | 50000 | | LLaVA-Finetune | cultural/laion-cultural-150k/laion-cultural-150k.json | 151072 | | Llava-JP-Instruct-108K | general/LLaVA-JP-Instruct-108K/data.json | 108855 | | llava-med-zh-instruct-60K | general/llava-med-zh-instruct-60k/data.json | 56649 | | LLaVA-NeXt | general/LLaVA-NeXt-Data/data.json | 119853 | | LVIS-Instruct4V | general/LVIS-Instruct4V/data.json | 222697 | | MTVQA | general/MTVQA/data.json | 6678 | | nvlr2-llava | general/nvlr2-llava/data.json | 86373 | | NuminaMath-CoT | text-only/NuminaMath-CoT/data.json | 100000 | | OpenHermes-2.5 | text-only/Openhermes-2.5/data.json | 399900 | | palo_multilingual_dataset | general/palo_multilingual_dataset/urdu-100k.json | 99992 | | ShareGPT-4o | general/ShareGPT-4o/data.json | 57289 | | ShareGPT4V | general/ShareGPT4V/data.json | 91021 | | STAIR-Captions | caption/STAIR-Captions/data.json | 82783 | | table-vqa | doc+chart/table-vqa/data.json | 16408 | | Viet-Doc-VQA | doc+chart/Viet-Doc-VQA/data.json | 12000 | | Viet-DOC-VQA-II | doc+chart/Viet-DOC-VQA-II/data.json | 14998 | | Viet-OCR-VQA | doc+chart/Viet-OCR-VQA/data.json | 30000 | | Viet-ShareGPT-4o-Text-VQA | general/Viet-ShareGPT-4o-Text-VQA/data.json | 42678 | | webui_multilingual_ocr | ocr/webui_multilingual_ocr/data.json | 300000 | | translation | translation/data.json | 1280328 | ## 应用场景 PangeaIns专为训练Pangea-7B模型设计。 ### 代码指令说明该数据集遵循LLaVA数据格式。如需获取PangeaIns的全部文件，请使用以下脚本： python from huggingface_hub import HfApi, hf_hub_download import json # 初始化API客户端 api = HfApi() dataset_name = "neulab/PangeaInstruct" # 获取并下载数据集中的所有文件 files = api.list_repo_files(repo_id=dataset_name, repo_type="dataset") for file in files: hf_hub_download(repo_id=dataset_name, filename=file, repo_type="dataset") print(f"已下载文件：{file}") # 加载完整的PangeaIns数据集 with open('PangeaIns.json') as f: data = json.load(f) 请注意，图像数据以`.tar`或`.zip`等压缩格式提供。下载完成后，您可能需要解压这些文件才能访问其中的图像。针对images.tar文件，可通过以下命令解压： bash tar -xvf images.tar 针对images.zip文件，可通过以下命令解压： bash unzip images.zip 对于部分大型tar文件，我们使用`split`命令将其拆分，例如`split -n 4 -d images.tar part_`。例如在`cultural/laion-multi-1M`子集中，我们将images.tar文件拆分为4个部分：`part_00`、`part_01`、`part_02`和`part_03`。此种情况下，您需要先合并拆分的文件，再解压tar包： bash cat part_* > images.tar tar -xvf images.tar PangeaIns数据集的每个子集（如ChartQA）均包含用于元数据的`.json`文件，以及对应存储图像的`.tar/.zip`文件。 ## 数据集引用 **BibTeX引用格式：** @article{yue2024pangeafullyopenmultilingual, title={Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages}, author={Xiang Yue and Yueqi Song and Akari Asai and Seungone Kim and Jean de Dieu Nyandwi and Simran Khanuja and Anjali Kantharuban and Lintang Sutawika and Sathyanarayanan Ramamoorthy and Graham Neubig}, year={2024}, journal={arXiv preprint arXiv:2410.16153}, url={https://arxiv.org/abs/2410.16153} } ## 联系方式对应联系人：{xyue2,yueqis,gneubig}@cs.cmu.edu

应用场景：