five

Open-Qwen2VL-Data

收藏
魔搭社区2026-01-09 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/swift/Open-Qwen2VL-Data
下载链接
链接失效反馈
官方服务:
资源简介:
## Introduction This repository contains the data for [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595). Project page: https://victorwz.github.io/Open-Qwen2VL Code: https://github.com/Victorwz/Open-Qwen2VL ## Dataset - ccs_ebdataset: CC3M-CC12M-SBU filtered by CLIP, we directly download the webdataset based on the [released of curated subset of BLIP-1](https://github.com/salesforce/BLIP) - datacomp_medium_dfn_webdataset: DataComp-Medium-128M filtered by DFN, we just select this subset based the uids released by DFN - datacomp_medium_mlm_filter_su_85_union_dfn_webdataset: DataComp-Medium-128M filtered by DFN union DataComp-Medium-128M filtered by MLM-Filter based on the semantic understanding metric with threshold 85 ## Acknowledgement This work was partially supported by the BioPACIFIC Materials Innovation Platform of the National Science Foundation under Award No. DMR-1933487 ## Citation ```bibtex @article{Open-Qwen2VL, title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources}, author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng}, journal={arXiv preprint arXiv:2504.00595}, year={2025} } ```

# 简介 本仓库包含论文《Open-Qwen2VL:基于学术资源的全开源多模态大语言模型(Large Language Model, LLM)的计算高效预训练》的配套数据,论文链接:https://huggingface.co/papers/2504.00595。 项目主页:https://victorwz.github.io/Open-Qwen2VL 代码仓库:https://github.com/Victorwz/Open-Qwen2VL # 数据集 - ccs_ebdataset:经CLIP筛选后的CC3M-CC12M-SBU数据集,我们直接基于[BLIP-1的精选子集发布内容](https://github.com/salesforce/BLIP)下载对应的webdataset格式数据。 - datacomp_medium_dfn_webdataset:经DFN筛选后的DataComp-Medium-128M数据集,我们仅根据DFN公布的唯一标识符(Unique Identifier, UID)选取该子集。 - datacomp_medium_mlm_filter_su_85_union_dfn_webdataset:将经DFN筛选的DataComp-Medium-128M数据集与基于语义理解指标(阈值设为85)经MLM-Filter筛选的DataComp-Medium-128M数据集取并集后得到的数据集。 # 致谢 本研究部分依托美国国家科学基金会(National Science Foundation)资助的BioPACIFIC材料创新平台(项目编号:DMR-1933487)完成。 # 引用格式 bibtex @article{Open-Qwen2VL, title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources}, author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng}, journal={arXiv preprint arXiv:2504.00595}, year={2025} }
提供机构:
maas
创建时间:
2025-04-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作