丹青100M数据集
收藏魔搭社区2026-05-31 更新2026-01-24 收录
下载链接:
https://modelscope.cn/datasets/deepglint/DanQing
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<img src="Figures/danqing.svg" width="30%">
**100M** Chinese image-text pairs | **12TB** dataset | **2024-2025** web data
<h1 align="center">DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset</h1>
</div>
<div align="center">
Hengyu Shen<sup>∗</sup>, [**Tiancheng Gu**](https://scholar.google.com/citations?hl=zh-CN&user=9etrpbYAAAAJ)<sup>∗</sup>, Bin Qin, Lan Wu, Yuling Wu, Shuo Tan, [**Zelong Sun**](https://scholar.google.com/citations?user=mDxuGMgAAAAJ&hl=zh-CN), Jun Wang, Nan Wu, [**Xiang An**](https://anxiangsir.github.io/), [**Weidong Cai**](https://weidong-tom-cai.github.io/), [**Ziyong Feng**](https://scholar.google.com/citations?user=xlKttUEAAAAJ&hl=zh-CN)<sup>‡</sup>, [**Kaicheng Yang**](https://kaicheng-yang0828.github.io)<sup>†</sup>
<sup>∗</sup> Equal Contribution | <sup>‡</sup> Team Leader | <sup>†</sup> Project Leader
[](https://arxiv.org/abs/2601.10305)
[](https://huggingface.co/datasets/DeepGlint-AI/DanQing100M)
[](https://www.modelscope.cn/datasets/deepglint/DanQing)
[](https://creativecommons.org/licenses/by/4.0/)
</div>
## 📣 News
<div align="left">
- [2026/01/16] ✨ We release the [paper](https://arxiv.org/abs/2601.10305) of DanQing.
- [2026/01/15] 🔥 We release the DanQing dataset (images and captions, about 12TB) in [ModelScope](https://www.modelscope.cn/datasets/deepglint/DanQing)
- [2026/01/13] ✨ We release the DanQing dataset (URLs of image and captions) in [🤗 Hugging Face](https://huggingface.co/datasets/DeepGlint-AI/DanQing100M)
> ⚠️ **Note:** Due to the storage and transmission limitations of Hugging Face, we only release the URLs corresponding to the images on Hugging Face. To access the complete dataset, please download it from **ModelScope**. We also provide synthetic short captions (generated by GLM4.1-base-9B) for the Danqing100M dataset in the recaption column.
</div>
---
## 📑 Table of Contents
- [💡 Highlights](#-highlights)
- [💻 Dataset Information](#-dataset-information)
- [Data Preview](#data-preview)
- [Topic Assessment](#topic-assessment)
- [Image Resolution and Text Length Distribution](#image-resolution-and-text-length-distribution)
- [Text Quality](#text-quality)
- [Cosine Similarity and Semantic Distribution](#cosine-similarity-and-semantic-distribution)
- [📊 Performance Comparison](#-performance-comparison)
- [Zero-Shot Classification](#zero-shot-classification)
- [Cross-Modal Retrieval (Short Caption)](#cross-modal-retrieval-short-caption)
- [Cross-Modal Retrieval (Long Caption)](#cross-modal-retrieval-long-caption)
- [Chinese-Centric Large Multimodal Model Tasks](#chinese-centric-large-multimodal-model-tasks)
- [🧠 Analysis](#-analysis)
- [Data and Model Scaling](#data-and-model-scaling)
- [New Concept Understanding](#new-concept-understanding)
- [📥 Download](#-download)
- [🤗 Hugging Face](#-hugging-face)
- [Python API](#python-api)
- [Command Line](#command-line)
- [ ModelScope](#-modelscope)
- [Python API](#python-api-1)
- [Command Line](#command-line-1)
- [📄 License](#-license)
- [📝 Citation](#-citation)
---
## 💡 Highlights
In this paper, we propose **DanQing** dataset, which contains **100 million** image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from **2024–2025** web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility.
We compare DanQing with existing datasets by conducting continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations.
<div align="center">
<img src="Figures/framework.png" width="100%">
</div>
---
## 💻 Dataset Information
### Data Preview
<div align="center">
<img src="Figures/case.png" width="100%">
</div>
### Topic Assessment
We implement a topic modeling pipeline based on [BERTopic](https://github.com/MaartenGr/BERTopic). We randomly sample 10M image-text pairs and extract text embeddings using [Chinese-CLIP-L/14](https://github.com/OFA-Sys/Chinese-CLIP). To address high-dimensional clustering, we apply UMAP for dimensionality reduction, followed by HDBSCAN to identify semantic clusters with a minimum cluster size of 1,000 for stability and noise reduction. Finally, we use class-based TF-IDF to extract representative keywords for each topic.
<div align="center">
<img src="Figures/topic_examples.png" width="100%">
</div>
### Image Resolution and Text Length Distribution
We analyze image resolutions by width, height, and minimum dimension, demonstrating a wide range of visual scales. We also report the distribution of text lengths across **2.2B** Chinese words.
<div align="center">
<img src="Figures/statistic.png" width="100%">
</div>
### Text Quality
We evaluate the text quality of DanQing using two metrics: **semantic word density** and **perplexity (PPL)**. We randomly sample 10M texts from DanQing, Wukong, and Zero for comparison. Semantic words (nouns, verbs, adjectives) are identified using the jieba toolkit, and their proportion in each sentence is calculated as semantic density. Sentence-level perplexity is computed with a pre-trained Chinese [BERT](https://huggingface.co/google-bert/bert-base-chinese) model.
<div align="center">
<img src="Figures/quality.png" width="100%">
</div>
### Cosine Similarity and Semantic Distribution
We analyze 10M-sample subsets of DanQing and Wukong by presenting image-text similarity distributions, extracted with [FG-CLIP2-L/16@256](https://huggingface.co/qihoo360/fg-clip2-large). For semantic distribution comparison, 10M images from each dataset are clustered into 10K groups using [FAISS](https://github.com/facebookresearch/faiss), with clusters ranked by sample count.
<div align="center">
<img src="Figures/distribution.png" width="100%">
</div>
---
## 📊 Performance Comparison
### Zero-Shot Classification
<div align="center">
<img src="Figures/classification.png" width="80%">
</div>
### Cross-Modal Retrieval (Short Caption)
<div align="center">
<img src="Figures/short.png" width="100%">
</div>
### Cross-Modal Retrieval (Long Caption)
<div align="center">
<img src="Figures/long.png" width="100%">
</div>
### Chinese-Centric Large Multimodal Model Tasks
<div align="center">
<img src="Figures/LMM.png" width="80%">
</div>
---
## 🧠 Analysis
### Data and Model Scaling
We compare the data and model scaling capabilities of DanQing and Wukong, reporting average zero-shot classification and retrieval (long & short caption) performance in the figure below.
<div align="center">
<img src="Figures/scaling.png" width="100%">
</div>
### New Concept Understanding
We evaluate SigLIP2-L/16 models pre-trained on various Chinese datasets for emergent concept understanding, and find that the model trained on DanQing consistently gives the highest confidence to correct pairs.
<div align="center">
<img src="Figures/new_concept.png" width="100%">
</div>
---
## 📥 Download
### 🤗 Hugging Face
#### Python API
```python
from datasets import load_dataset
ds = load_dataset("DeepGlint-AI/DanQing100M")
```
#### Command Line
```bash
# Install dependencies
# brew install git-xet # macOS
# git xet install
# sudo apt update # Ubuntu/Debian
# sudo apt install aria2
# Install git-lfs
# curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
# sudo apt-get install git-lfs
# git lfs install
# Download dataset URLs and captions
bash hfd.sh DeepGlint-AI/DanQing100M --dataset --tool aria2c -x 10
# Download images using img2dataset
# pip install img2dataset
# For better performance, it's highly recommended to set up a fast dns resolver
# See: https://github.com/rom1504/img2dataset#setting-up-a-high-performance-dns-resolver
img2dataset --url_list DanQing100M/data \
--input_format "parquet" \
--url_col "url" \
--caption_col "alt_text" \
--output_format webdataset \
--output_folder DanQing100M-webdataset \
--processes_count 16 \
--thread_count 32 \
--image_size 256 \
--resize_only_if_bigger=True \
--resize_mode="keep_ratio" \
--skip_reencode=True \
--save_additional_columns '["recaption"]' \
--enable_wandb False
```
### ModelScope
#### Python API
```python
from modelscope.msdatasets import MsDataset
ds = MsDataset.load('deepglint/DanQing')
```
#### Command Line
```bash
pip install modelscope
modelscope download --dataset deepglint/DanQing
```
---
## 📄 License
The DanQing dataset is licensed under [CC-BY-4.0 License](https://creativecommons.org/licenses/by/4.0/). The full license can be found in the [LICENSE.cc-by-4.0 file](./LICENSE.cc-by-4.0). The dataset is collected from Common Crawl web pages and may contain biased or sensitive content. The collected data is subject to the license to which each content belongs. Users are solely responsible for ensuring compliance with ethical and legal standards in their research or applications.
---
## 📝 Citation
If you find this repository useful, please use the following BibTeX entry for citation.
```bibtex
@misc{danqing,
title={DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset},
author={Hengyu Shen and Tiancheng Gu and Bin Qin and Lan Wu and Yuling Wu and Shuo Tan and Zelong Sun and Jun Wang and Nan Wu and Xiang An and Weidong Cai and Ziyong Feng and Kaicheng Yang},
year={2026},
eprint={2601.10305},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.10305},
}
```
---
<div align="center">
### ⭐ Don't forget to star this repository if you find it helpful!
</div>
<div align="center">
<img src="Figures/danqing.svg" width="30%">
**100M** 中文图文对 | **12TB** 数据集 | **2024-2025** 网页采集数据
<h1 align="center">丹青:一款面向中文的大规模时效性视觉语言预训练数据集</h1>
</div>
<div align="center">
沈恒宇<sup>∗</sup>, [**顾天成**](https://scholar.google.com/citations?hl=zh-CN&user=9etrpbYAAAAJ)<sup>∗</sup>, 秦斌, 吴岚, 吴玉玲, 谭硕, [**孙泽龙**](https://scholar.google.com/citations?user=mDxuGMgAAAAJ&hl=zh-CN), 王军, 吴楠, [**安翔**](https://anxiangsir.github.io/), [**蔡卫东**](https://weidong-tom-cai.github.io/), [**冯子勇**](https://scholar.google.com/citations?user=xlKttUEAAAAJ&hl=zh-CN)<sup>‡</sup>, [**杨凯诚**](https://kaicheng-yang0828.github.io)<sup>†</sup>
<sup>∗</sup> 共同第一作者 | <sup>‡</sup> 团队负责人 | <sup>†</sup> 项目负责人
[](https://arxiv.org/abs/2601.10305)
[](https://huggingface.co/datasets/DeepGlint-AI/DanQing100M)
[](https://www.modelscope.cn/datasets/deepglint/DanQing)
[](https://creativecommons.org/licenses/by/4.0/)
</div>
## 📣 最新动态
<div align="left">
- [2026/01/16] ✨ 发布丹青数据集的研究论文。
- [2026/01/15] 🔥 在ModelScope平台发布丹青数据集(含完整图像与标注文本,总规模约12TB)
- [2026/01/13] ✨ 在🤗 Hugging Face平台发布丹青数据集的图文数据URL与标注文本
> ⚠️ **注意:** 受限于Hugging Face的存储与传输限制,该平台仅发布数据集对应的图像URL。如需获取完整数据集,请前往**ModelScope**下载。此外,我们在recaption字段中提供了由GLM4.1-base-9B生成的丹青100M数据集合成短标注文本。
</div>
---
## 📑 目录
- [💡 亮点](#-highlights)
- [💻 数据集信息](#-dataset-information)
- [数据预览](#data-preview)
- [主题评估](#topic-assessment)
- [图像分辨率与文本长度分布](#image-resolution-and-text-length-distribution)
- [文本质量](#text-quality)
- [余弦相似度与语义分布](#cosine-similarity-and-semantic-distribution)
- [📊 性能对比](#-performance-comparison)
- [零样本分类](#zero-shot-classification)
- [跨模态检索(短标注)](#cross-modal-retrieval-short-caption)
- [跨模态检索(长标注)](#cross-modal-retrieval-long-caption)
- [面向中文的大型多模态模型任务](#chinese-centric-large-multimodal-model-tasks)
- [🧠 分析](#-analysis)
- [数据与模型缩放特性](#data-and-model-scaling)
- [新概念理解能力](#new-concept-understanding)
- [📥 下载](#-download)
- [🤗 Hugging Face](#-hugging-face)
- [Python API](#python-api)
- [命令行](#command-line)
- [ModelScope](#-modelscope)
- [Python API](#python-api-1)
- [命令行](#command-line-1)
- [📄 许可证](#-license)
- [📝 引用](#-citation)
---
## 💡 亮点
本文提出**丹青**数据集,其包含从Common Crawl采集的**1亿对**图文数据。与现有数据集不同,丹青通过更为严苛的筛选流程构建,具备更优异的数据质量。此外,本数据集主要采集自**2024至2025年**的网页数据,可帮助模型更好地捕捉演进中的语义趋势,因此具备更强的实用价值。
我们通过对SigLIP2模型进行持续预训练,将丹青与现有数据集进行对比。实验结果表明,丹青在零样本分类、跨模态检索以及基于大型多模态模型的评估等多项中文下游任务中始终展现出更优异的性能。
<div align="center">
<img src="Figures/framework.png" width="100%">
</div>
---
## 💻 数据集信息
### 数据预览
<div align="center">
<img src="Figures/case.png" width="100%">
</div>
### 主题评估
我们实现了基于[BERTopic](https://github.com/MaartenGr/BERTopic)的主题建模流程。我们随机采样1000万对图文数据,并使用[Chinese-CLIP-L/14](https://github.com/OFA-Sys/Chinese-CLIP)提取文本嵌入。为处理高维聚类任务,我们先通过UMAP进行降维,再使用HDBSCAN识别语义簇,最小簇规模设为1000以保证聚类稳定性并过滤噪声。最终,我们基于类别的TF-IDF算法提取每个主题的代表性关键词。
<div align="center">
<img src="Figures/topic_examples.png" width="100%">
</div>
### 图像分辨率与文本长度分布
我们从图像宽度、高度及最小维度三个维度分析图像分辨率,展现出覆盖范围广泛的视觉尺度。我们同时报告了数据集包含的**22亿**中文词汇的文本长度分布情况。
<div align="center">
<img src="Figures/statistic.png" width="100%">
</div>
### 文本质量
我们采用**语义词密度**和**困惑度(PPL)**两项指标评估丹青的文本质量。我们从丹青、悟空(Wukong)和Zero数据集中随机采样1000万条文本进行对比。我们使用jieba分词工具包识别语义词(名词、动词、形容词),并计算其在句子中的占比作为语义密度。我们使用预训练中文[BERT](https://huggingface.co/google-bert/bert-base-chinese)模型计算句子级困惑度。
<div align="center">
<img src="Figures/quality.png" width="100%">
</div>
### 余弦相似度与语义分布
我们使用[FG-CLIP2-L/16@256](https://huggingface.co/qihoo360/fg-clip2-large)提取特征,分析了丹青与悟空数据集1000万样本子集的图文相似度分布。为对比语义分布,我们使用[FAISS](https://github.com/facebookresearch/faiss)将两个数据集的1000万张图像聚类为10000个组,并按样本数量对簇进行排序。
<div align="center">
<img src="Figures/distribution.png" width="100%">
</div>
---
## 📊 性能对比
### 零样本分类
<div align="center">
<img src="Figures/classification.png" width="80%">
</div>
### 跨模态检索(短标注)
<div align="center">
<img src="Figures/short.png" width="100%">
</div>
### 跨模态检索(长标注)
<div align="center">
<img src="Figures/long.png" width="100%">
</div>
### 面向中文的大型多模态模型任务
<div align="center">
<img src="Figures/LMM.png" width="80%">
</div>
---
## 🧠 分析
### 数据与模型缩放特性
我们对比了丹青与悟空数据集的数据与模型缩放能力,并在下图中报告了平均零样本分类与检索(长、短标注)性能。
<div align="center">
<img src="Figures/scaling.png" width="100%">
</div>
### 新概念理解能力
我们评估了在多种中文数据集上预训练的SigLIP2-L/16模型的新兴概念理解能力,发现基于丹青预训练的模型始终对正确图文对给出最高的置信度。
<div align="center">
<img src="Figures/new_concept.png" width="100%">
</div>
---
## 📥 下载
### 🤗 Hugging Face
#### Python API
python
from datasets import load_dataset
ds = load_dataset("DeepGlint-AI/DanQing100M")
#### 命令行
bash
# 安装依赖项
# brew install git-xet # macOS系统
# git xet install
# sudo apt update # Ubuntu/Debian系统
# sudo apt install aria2
# 安装git-lfs
# curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
# sudo apt-get install git-lfs
# git lfs install
# 下载数据集URL与标注文本
bash hfd.sh DeepGlint-AI/DanQing100M --dataset --tool aria2c -x 10
# 使用img2dataset下载图像
# pip install img2dataset
# 为获得更佳性能,强烈建议配置高性能DNS解析器
# 参考:https://github.com/rom1504/img2dataset#setting-up-a-high-performance-dns-resolver
img2dataset --url_list DanQing100M/data
--input_format "parquet"
--url_col "url"
--caption_col "alt_text"
--output_format webdataset
--output_folder DanQing100M-webdataset
--processes_count 16
--thread_count 32
--image_size 256
--resize_only_if_bigger=True
--resize_mode="keep_ratio"
--skip_reencode=True
--save_additional_columns '["recaption"]'
--enable_wandb False
### ModelScope
#### Python API
python
from modelscope.msdatasets import MsDataset
ds = MsDataset.load('deepglint/DanQing')
#### 命令行
bash
pip install modelscope
modelscope download --dataset deepglint/DanQing
---
## 📄 许可证
丹青数据集采用[CC-BY-4.0许可证](https://creativecommons.org/licenses/by/4.0/)进行授权,完整授权条款可参见[LICENSE.cc-by-4.0文件](./LICENSE.cc-by-4.0)。本数据集采集自Common Crawl网页资源,可能包含偏见或敏感内容,所采集数据需遵循原内容所属的授权协议。使用者需自行确保其研究或应用符合伦理与法律规范。
---
## 📝 引用
若您认为本仓库对您的研究有所帮助,请使用以下BibTeX条目进行引用。
bibtex
@misc{danqing,
title={DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset},
author={Hengyu Shen and Tiancheng Gu and Bin Qin and Lan Wu and Yuling Wu and Shuo Tan and Zelong Sun and Jun Wang and Nan Wu and Xiang An and Weidong Cai and Ziyong Feng and Kaicheng Yang},
year={2026},
eprint={2601.10305},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.10305},
}
---
<div align="center">
### ⭐ 如果您觉得本仓库有帮助,请别忘了点亮Star!
</div>
提供机构:
maas
创建时间:
2026-01-06
搜集汇总
数据集介绍

背景与挑战
背景概述
丹青100M数据集是一个大规模中文视觉语言预训练数据集,包含1亿个图文对,总计12TB,数据采集自2024-2025年的网络内容,经过严格筛选以提升质量。该数据集旨在帮助模型更好地捕捉语义趋势,提升在中文下游任务中的性能。
以上内容由遇见数据集搜集并总结生成



