DOCCI-CN
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/360zhinao/DOCCI-CN
下载链接
链接失效反馈官方服务:
资源简介:
# FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model
Code: https://github.com/360CVGroup/FG-CLIP
FG-CLIP 2 is the foundation model for fine-grained vision-language understanding in both English and Chinese.
Across 29 datasets and 8 diverse tasks, it consistently surpasses recent strong baselines such as SigLIP 2 and MetaCLIP 2, achieving the best reported performance to date in both languages.
**[FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model](https://arxiv.org/abs/2510.10921)**
</br>
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)
</br>
[](https://arxiv.org/abs/2510.10921)
[](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
[](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
[](https://research.360.cn/sass/index)
**[FG-CLIP: Fine-Grained Visual and Textual Alignment](https://arxiv.org/abs/2505.05071)** ([code branch: v1.0](https://github.com/360CVGroup/FG-CLIP/tree/v1.0))
</br>
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin (*Equal Contribution, ✝Corresponding Author)
</br>
[](https://arxiv.org/abs/2505.05071)
[](https://icml.cc/Conferences/2025)
[](https://huggingface.co/collections/qihoo360/fg-clip-681da45d4acfb65c240a6d08)
[](https://huggingface.co/datasets/qihoo360/FineHARD)
[](https://deepwiki.com/360CVGroup/FG-CLIP)
## Data Preparation
To run the inference code for FG-CLIP 2, please follow the following step.
### Step 1: Download the model
#### Model Zoo
|Models | ViT | Model Weights | Demo |
|:-----------|:-----------------------:|:---------------------------------------------------------:|:--------------------------------------------------------:|
| FG-CLIP-Base | vit-base-patch16-224 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip-base) | [Retrieval](https://huggingface.co/spaces/qihoo360/FG-CLIP-Retrieval-demo) & [Dense Feature](https://huggingface.co/spaces/qihoo360/FG-CLIP-Densefeature-demo) |
| FG-CLIP-Large | vit-large-patch14-336 | 🤗[Huggingface](https://huggingface.co/qihoo360/fg-clip-large) | |
| FG-CLIP2-Base | vit-base-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-base) | [Retrieval](https://huggingface.co/spaces/qihoo360/FG-CLIP2-Retrieval-demo) & [Dense Feature](https://huggingface.co/spaces/qihoo360/FG-CLIP2-Densefeature-demo) |
| FG-CLIP2-Large | vit-large-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-large) | |
| FG-CLIP2-So400m | vit-so400m-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-so400m) | |
### Step 2: Prepare DOCCI-CN Dataset
First, pull the dataset from the following link.
[🤗DOCCI-CN](https://huggingface.co/datasets/qihoo360/DOCCI-CN),After downloading, unzip all compressed files, you will obtain the following file structure:
```none
DOCCI-CN
├── txtfile
| ├── image_caption.txt
├── images
| ├── test_00000.jpg
│ ├── test_00001.jpg
│ ├── ...
│ ├── test_04999.jpg
```
Benchmarks
|Model| BackBone |I2T|T2I|
| ---- | ---- |---- |---- |
|R2D2|ViT-B/16|36.1|36.9|
|Chinese-CLIP|ViT-B/16|44.6|43.1|
|SigLIP 2|ViT-B/16|7.6|5.7|
|**FG-CLIP 2(ours)**|ViT-B/16|**71.2**|**75.4**|
|R2D2|ViT-L/14|49.5|46.3|
|Chinese-CLIP|ViT-L/14|49.7|50.8|
|SigLIP 2|ViT-L/16|24.6|27.3|
|**FG-CLIP 2(ours)**|ViT-L/16|**77.6**|**81.9**|
|SigLIP 2|ViT-So/16|25.0|21.3|
|MetaCLIP 2|ViT-H/14|73.8|77.2|
|**FG-CLIP 2(ours)**|ViT-So/16|**79.7**|**84.0**|
## Citation
If you find DOCCI-CN useful for your research and applications, please cite using this BibTeX:
```
@article{xie2025fg2,
title={FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Ao, Ji and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2510.10921},
year={2025}
}
```
```
@article{xie2025fg,
title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2505.05071},
year={2025}
}
```
## License
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
The content of this project itself is licensed under the [Apache license 2.0](./LICENSE).
# FG-CLIP 2:双语细粒度视觉-语言对齐模型
代码仓库:https://github.com/360CVGroup/FG-CLIP
FG-CLIP 2是面向中英双语的细粒度视觉-语言(Vision-Language)理解基础模型。在29个数据集与8类多样化任务上,该模型持续超越SigLIP 2、MetaCLIP 2等近期顶尖基线模型,在双语场景下均达到目前已公开的最优性能。
**[FG-CLIP 2:双语细粒度视觉-语言对齐模型](https://arxiv.org/abs/2510.10921)**
</br>
谢春宇*,王斌*,孔繁静,李金成,梁大伟,敖骥,冷大伟†,殷玉辉(* 共同第一作者,✝ 通讯作者)
</br>
[](https://arxiv.org/abs/2510.10921)
[](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
[](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
[](https://research.360.cn/sass/index)
**[FG-CLIP:细粒度视觉与文本对齐](https://arxiv.org/abs/2505.05071)**([代码分支:v1.0](https://github.com/360CVGroup/FG-CLIP/tree/v1.0))
</br>
谢春宇*,王斌*,孔繁静,李金成,梁大伟,张根申,冷大伟†,殷玉辉(* 共同第一作者,✝ 通讯作者)
</br>
[](https://arxiv.org/abs/2505.05071)
[](https://icml.cc/Conferences/2025)
[](https://huggingface.co/collections/qihoo360/fg-clip-681da45d4acfb65c240a6d08)
[](https://huggingface.co/datasets/qihoo360/FineHARD)
[](https://deepwiki.com/360CVGroup/FG-CLIP)
## 数据准备
若需运行FG-CLIP 2的推理代码,请遵循以下步骤。
### 步骤1:下载模型
#### 模型库
|模型名称 | 视觉Transformer(Vision Transformer, ViT) | 模型权重 | 演示示例 |
|:-----------|:-----------------------:|:---------------------------------------------------------:|:--------------------------------------------------------:|
| FG-CLIP-Base | vit-base-patch16-224 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip-base) | [检索演示](https://huggingface.co/spaces/qihoo360/FG-CLIP-Retrieval-demo) & [稠密特征演示](https://huggingface.co/spaces/qihoo360/FG-CLIP-Densefeature-demo) |
| FG-CLIP-Large | vit-large-patch14-336 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip-large) | |
| FG-CLIP2-Base | vit-base-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-base) | [检索演示](https://huggingface.co/spaces/qihoo360/FG-CLIP2-Retrieval-demo) & [稠密特征演示](https://huggingface.co/spaces/qihoo360/FG-CLIP2-Densefeature-demo) |
| FG-CLIP2-Large | vit-large-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-large) | |
| FG-CLIP2-So400m | vit-so400m-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-so400m) | |
### 步骤2:准备DOCCI-CN数据集
首先,请从以下链接拉取数据集:[🤗DOCCI-CN](https://huggingface.co/datasets/qihoo360/DOCCI-CN)。下载完成后解压所有压缩文件,将得到如下文件结构:
none
DOCCI-CN
├── txtfile
| ├── image_caption.txt
├── images
| ├── test_00000.jpg
│ ├── test_00001.jpg
│ ├── ...
│ ├── test_04999.jpg
## 基准测试结果
|模型| 骨干网络(Backbone) |图像到文本检索(I2T)|文本到图像检索(T2I)|
| ---- | ---- |---- |---- |
|R2D2|ViT-B/16|36.1|36.9|
|中文CLIP(Chinese-CLIP)|ViT-B/16|44.6|43.1|
|SigLIP 2|ViT-B/16|7.6|5.7|
|**FG-CLIP 2(本文提出方法)**|ViT-B/16|**71.2**|**75.4**|
|R2D2|ViT-L/14|49.5|46.3|
|中文CLIP(Chinese-CLIP)|ViT-L/14|49.7|50.8|
|SigLIP 2|ViT-L/16|24.6|27.3|
|**FG-CLIP 2(本文提出方法)**|ViT-L/16|**77.6**|**81.9**|
|SigLIP 2|ViT-So/16|25.0|21.3|
|MetaCLIP 2|ViT-H/14|73.8|77.2|
|**FG-CLIP 2(本文提出方法)**|ViT-So/16|**79.7**|**84.0**|
## 引用格式
若您的研究与应用场景用到DOCCI-CN数据集,请采用以下BibTeX格式进行引用:
bibtex
@article{xie2025fg2,
title={FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Ao, Ji and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2510.10921},
year={2025}
}
bibtex
@article{xie2025fg,
title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2505.05071},
year={2025}
}
## 许可证
本项目使用的部分数据集与模型权重需遵循其原始许可证条款,使用者需严格遵守对应许可证的全部要求。本项目自身内容采用[Apache许可证2.0(Apache License 2.0)](./LICENSE)进行授权。
提供机构:
maas
创建时间:
2025-10-16



