Bunny-v1.0-data
收藏魔搭社区2026-04-18 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/BoyaWu10/Bunny-v1.0-data
下载链接
链接失效反馈官方服务:
资源简介:
# Bunny-v1.0 Dataset Card
📖 [Technical report](https://arxiv.org/abs/2402.11530) | 🤖 [Bunny-v1.0-3B](https://www.modelscope.cn/models/BAAI/Bunny-v1.0-3B) | 🏠 [Code](https://github.com/BAAI-DCAI/Bunny) | 🐰 [Demo](http://bunny.baai.ac.cn)
Bunny is a family of lightweight multimodal models.
Bunny-v1.0-data is the training dataset for Bunny-v1.0 series, including [Bunny-v1.0-3B](https://www.modelscope.cn/models/BAAI/Bunny-v1.0-3B/summary).
## Pretrain
We use a high-quality coreset with less duplicates and more informative samples of LAION-2B built by [this work](https://github.com/BAAI-DCAI/Dataset-Pruning/tree/main/LAION).
We randomly sample 2 million image-text pairs from the coreset and convert them to training format.
The pretraining data and images can be found in `pretrain` folder.
## Finetune
We build Bunny-695K by modifying [SVIT-mix-665K](https://arxiv.org/abs/2307.04087) for finetuning.
The finetuning data can be found in `finetune` folder.
## Usage
The images are packed into multiple packages.
After downloading the images, run the following script to merge them into one:
```shell
cat images.tar.gz.part-* > images.tar.gz
```
Then unpack the package with following command:
```shell
tar -xvzf images.tar.gz
```
## License
The content of this project itself is licensed under the Apache license 2.0.
# Bunny-v1.0 数据集卡片
📖 [技术报告](https://arxiv.org/abs/2402.11530) | 🤖 [Bunny-v1.0-3B](https://www.modelscope.cn/models/BAAI/Bunny-v1.0-3B) | 🏠 [代码](https://github.com/BAAI-DCAI/Bunny) | 🐰 [演示站点](http://bunny.baai.ac.cn)
Bunny是一系列轻量级多模态模型。
Bunny-v1.0-data是Bunny-v1.0系列模型的训练数据集,涵盖[Bunny-v1.0-3B](https://www.modelscope.cn/models/BAAI/Bunny-v1.0-3B/summary)。
## 预训练
我们采用由[此项工作](https://github.com/BAAI-DCAI/Dataset-Pruning/tree/main/LAION)构建的高质量LAION-2B (LAION-2B)核心子集,该子集重复样本更少、信息丰富度更高。我们从该核心子集中随机抽取200万张图像-文本对,并将其转换为训练格式。预训练数据与图像可在`pretrain`文件夹中获取。
## 微调
我们通过修改[SVIT-mix-665K (SVIT-mix-665K)](https://arxiv.org/abs/2307.04087)构建了Bunny-695K数据集用于微调。微调数据可在`finetune`文件夹中获取。
## 使用方法
图像被打包为多个分卷压缩包。下载完成后,执行以下脚本将分卷合并为单个压缩包:
shell
cat images.tar.gz.part-* > images.tar.gz
随后使用以下命令解压该压缩包:
shell
tar -xvzf images.tar.gz
## 许可协议
本项目内容采用Apache许可证2.0进行授权。
提供机构:
maas
创建时间:
2024-03-05



