Open-Qwen2VL-Data
收藏魔搭社区2026-01-09 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/swift/Open-Qwen2VL-Data
下载链接
链接失效反馈官方服务:
资源简介:
## Introduction
This repository contains the data for [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595).
Project page: https://victorwz.github.io/Open-Qwen2VL
Code: https://github.com/Victorwz/Open-Qwen2VL
## Dataset
- ccs_ebdataset: CC3M-CC12M-SBU filtered by CLIP, we directly download the webdataset based on the [released of curated subset of BLIP-1](https://github.com/salesforce/BLIP)
- datacomp_medium_dfn_webdataset: DataComp-Medium-128M filtered by DFN, we just select this subset based the uids released by DFN
- datacomp_medium_mlm_filter_su_85_union_dfn_webdataset: DataComp-Medium-128M filtered by DFN union DataComp-Medium-128M filtered by MLM-Filter based on the semantic understanding metric with threshold 85
## Acknowledgement
This work was partially supported by the BioPACIFIC Materials Innovation Platform of the National Science Foundation under Award No. DMR-1933487
## Citation
```bibtex
@article{Open-Qwen2VL,
title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
journal={arXiv preprint arXiv:2504.00595},
year={2025}
}
```
# 简介
本仓库包含论文《Open-Qwen2VL:基于学术资源的全开源多模态大语言模型(Large Language Model, LLM)的计算高效预训练》的配套数据,论文链接:https://huggingface.co/papers/2504.00595。
项目主页:https://victorwz.github.io/Open-Qwen2VL
代码仓库:https://github.com/Victorwz/Open-Qwen2VL
# 数据集
- ccs_ebdataset:经CLIP筛选后的CC3M-CC12M-SBU数据集,我们直接基于[BLIP-1的精选子集发布内容](https://github.com/salesforce/BLIP)下载对应的webdataset格式数据。
- datacomp_medium_dfn_webdataset:经DFN筛选后的DataComp-Medium-128M数据集,我们仅根据DFN公布的唯一标识符(Unique Identifier, UID)选取该子集。
- datacomp_medium_mlm_filter_su_85_union_dfn_webdataset:将经DFN筛选的DataComp-Medium-128M数据集与基于语义理解指标(阈值设为85)经MLM-Filter筛选的DataComp-Medium-128M数据集取并集后得到的数据集。
# 致谢
本研究部分依托美国国家科学基金会(National Science Foundation)资助的BioPACIFIC材料创新平台(项目编号:DMR-1933487)完成。
# 引用格式
bibtex
@article{Open-Qwen2VL,
title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
journal={arXiv preprint arXiv:2504.00595},
year={2025}
}
提供机构:
maas
创建时间:
2025-04-30



