Bunny-v1.1-data

Name: Bunny-v1.1-data
Creator: maas
Published: 2026-04-07 17:11:11
License: 暂无描述

魔搭社区2026-04-07 更新2024-06-22 收录

下载链接：

https://modelscope.cn/datasets/BoyaWu10/Bunny-v1.1-data

下载链接

链接失效反馈

官方服务：

资源简介：

# Bunny-v1.1 Dataset Card 📖 [Technical report](https://arxiv.org/abs/2402.11530) | 🏠 [Code](https://github.com/BAAI-DCAI/Bunny) | 🐰 [Demo](http://bunny.baai.ac.cn) Bunny is a family of lightweight multimodal models. Bunny-v1.1-data is the training dataset for both Bunny-v1.1 and Bunny-v1.0 series, including [Bunny-v1.1-Llama-3-8B-V](https://huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V) and [Bunny-v1.1-4B](https://huggingface.co/BAAI/Bunny-v1_1-4B). ## Pretrain We use a high-quality coreset with less duplicates and more informative samples of LAION-2B built by [this work](https://github.com/BAAI-DCAI/Dataset-Pruning/tree/main/LAION). We randomly sample 2 million image-text pairs from the coreset and convert them to training format. The pretraining data and images can be found in `pretrain` folder, which are the same as the ones in Bunny-v1.0-data. ## Finetune In Bunny-v1.0-data, we build Bunny-695K by modifying [SVIT-mix-665K](https://arxiv.org/abs/2307.04087) for finetuning. And we then combine it with LLaVA-665K and ALLaVA-Instruct-4V, i.e., Bunny-LLaVA-1.4M, Bunny-ALLaVA-1.3M, and Bunny-LLaVA-ALLaVA-2M. The finetuning data can be found in `finetune` folder. ## Usage The images are packed into multiple packages. After downloading the images, run the following script to merge them into one: ```shell cat images.tar.gz.part-* > images.tar.gz ``` Then unpack the package with following command: ```shell tar -xvzf images.tar.gz ``` ## License The content of this project itself is licensed under the Apache license 2.0.

# Bunny-v1.1 数据集卡片 📖 [技术报告](https://arxiv.org/abs/2402.11530) | 🏠 [代码仓库](https://github.com/BAAI-DCAI/Bunny) | 🐰 [在线演示](http://bunny.baai.ac.cn) Bunny是轻量级多模态模型（multimodal model）系列。 Bunny-v1.1-data是Bunny-v1.1与Bunny-v1.0系列模型的训练数据集，涵盖[Bunny-v1.1-Llama-3-8B-V](https://huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V)与[Bunny-v1.1-4B](https://huggingface.co/BAAI/Bunny-v1_1-4B)两款模型。 ## 预训练我们采用由[该项工作](https://github.com/BAAI-DCAI/Dataset-Pruning/tree/main/LAION)构建的LAION-2B高质量核心子集（coreset），该子集重复样本更少、样本信息密度更高。我们从该核心子集中随机采样200万张图像-文本对并转换为训练格式。预训练数据与图像可在`pretrain`文件夹中获取，其内容与Bunny-v1.0-data中的完全一致。 ## 微调在Bunny-v1.0-data中，我们通过修改[SVIT-mix-665K](https://arxiv.org/abs/2307.04087)构建了Bunny-695K用于微调。随后我们将其与LLaVA-665K、ALLaVA-Instruct-4V进行融合，分别得到Bunny-LLaVA-1.4M、Bunny-ALLaVA-1.3M以及Bunny-LLaVA-ALLaVA-2M。微调数据可在`finetune`文件夹中获取。 ## 使用方法图像文件被打包为多个分卷压缩包。下载完成后，请执行以下脚本将分卷合并为完整压缩包： shell cat images.tar.gz.part-* > images.tar.gz 随后使用如下命令解压该压缩包： shell tar -xvzf images.tar.gz ## 许可证本项目内容采用Apache许可证2.0（Apache License 2.0）进行授权。

提供机构：

maas

创建时间：

2024-06-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集