five

Mono-InternVL-2B-Synthetic-Data

收藏
魔搭社区2025-12-10 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data
下载链接
链接失效反馈
官方服务:
资源简介:
# Mono-InternVL-2B Synthetic Data This dataset is used for training the S1.2 stage of Mono-InternVL-2B, as described in the paper [Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models](https://huggingface.co/papers/2507.12566). - **Project Page:** [https://internvl.github.io/blog/2024-10-10-Mono-InternVL/](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/) - **Code:** [https://github.com/OpenGVLab/Mono-InternVL](https://github.com/OpenGVLab/Mono-InternVL) ## Dataset Description ### Purpose This dataset is used for training the S1.2 stage of Mono-InternVL-2B. ### Data Source We utilize the pre-trained InternVL-8B to produce short captions for 258 million images sampled from Laion-2B, Coyo-700M and SAM(en). ### Size - Total records: 259,064,832 - Files: 3,072 JSONL files, each containing 84,331 records. ## Dataset Structure ### File Format - Each file is in JSON Lines (`.jsonl`) format, with one JSON object per line. ## Sample Usage You can download the dataset directly using `git lfs`: ```bash git lfs install git clone https://huggingface.co/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data ``` Alternatively, you can load the dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the dataset. By default, all JSONL files will be loaded into a 'train' split. dataset = load_dataset("OpenGVLab/Mono-InternVL-2B-Synthetic-Data") # Print the dataset info print(dataset) # Access an example from the 'train' split print(dataset["train"][0]) ``` For custom dataset preparation for supervised finetuning, refer to the [Dataset Preparation section in the official GitHub repository](https://github.com/OpenGVLab/Mono-InternVL#dataset-preparation). An example JSONL entry for fine-tuning conversations looks like this: ```json { "id": "000000120375", "image": "coco/train2017/000000120375.jpg", "conversations": [ { "from": "human", "value": "<image>\ What type of vehicle is driving down the street in the image?" }, { "from": "gpt", "value": "A red sports utility vehicle (SUV) is driving down the street in the image." }, { "from": "human", "value": "Is the street crowded with people?" }, { "from": "gpt", "value": "Yes, the street is filled with a considerable number of people, which indicates that the area is busy." } ] } ```

# Mono-InternVL-2B 合成数据集 本数据集用于训练Mono-InternVL-2B的S1.2阶段,相关细节可参阅论文《Mono-InternVL-1.5:面向更轻量化、更快速的一体化多模态大语言模型》(论文链接:https://huggingface.co/papers/2507.12566)。 - **项目主页:** [https://internvl.github.io/blog/2024-10-10-Mono-InternVL/](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/) - **代码仓库:** [https://github.com/OpenGVLab/Mono-InternVL](https://github.com/OpenGVLab/Mono-InternVL) ## 数据集说明 ### 数据集用途 本数据集用于训练Mono-InternVL-2B的S1.2阶段。 ### 数据来源 我们利用预训练的InternVL-8B模型,对从Laion-2B、Coyo-700M和SAM(en)中采样的2.58亿张图像生成简短描述文本。 ### 数据集规模 - 总记录数:259,064,832 - 文件数量:3072个JSON Lines(.jsonl)格式文件,每个文件包含84,331条记录。 ## 数据集结构 ### 文件格式 所有文件均采用JSON Lines(.jsonl)格式,每行对应一个JSON对象。 ## 样例使用方式 您可通过`git lfs`直接下载本数据集: bash git lfs install git clone https://huggingface.co/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data 或者,您可通过Hugging Face的`datasets`库加载本数据集: python from datasets import load_dataset # 加载数据集,默认情况下所有JSONL文件将被加载至'train'划分下 dataset = load_dataset("OpenGVLab/Mono-InternVL-2B-Synthetic-Data") # 打印数据集信息 print(dataset) # 访问'train'划分下的一条样本 print(dataset["train"][0]) 如需针对监督微调任务进行自定义数据集准备,请参阅官方GitHub仓库中的[数据集准备章节](https://github.com/OpenGVLab/Mono-InternVL#dataset-preparation)。一条用于微调对话任务的JSONL样本示例如下: json { "id": "000000120375", "image": "coco/train2017/000000120375.jpg", "conversations": [ { "from": "human", "value": "<image> What type of vehicle is driving down the street in the image?" }, { "from": "gpt", "value": "A red sports utility vehicle (SUV) is driving down the street in the image." }, { "from": "human", "value": "Is the street crowded with people?" }, { "from": "gpt", "value": "Yes, the street is filled with a considerable number of people, which indicates that the area is busy." } ] }
提供机构:
maas
创建时间:
2025-03-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作