LIT-CN
收藏魔搭社区2025-12-18 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/360zhinao/LIT-CN
下载链接
链接失效反馈官方服务:
资源简介:
# FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model
Code: https://github.com/360CVGroup/FG-CLIP
FG-CLIP 2 is the foundation model for fine-grained vision-language understanding in both English and Chinese.
Across 29 datasets and 8 diverse tasks, it consistently surpasses recent strong baselines such as SigLIP 2 and MetaCLIP 2, achieving the best reported performance to date in both languages.
**[FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model](https://arxiv.org/abs/2510.10921)**
</br>
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)
</br>
[](https://arxiv.org/abs/2510.10921)
[](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
[](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
[](https://research.360.cn/sass/index)
**[FG-CLIP: Fine-Grained Visual and Textual Alignment](https://arxiv.org/abs/2505.05071)** ([code branch: v1.0](https://github.com/360CVGroup/FG-CLIP/tree/v1.0))
</br>
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin (*Equal Contribution, ✝Corresponding Author)
</br>
[](https://arxiv.org/abs/2505.05071)
[](https://icml.cc/Conferences/2025)
[](https://huggingface.co/collections/qihoo360/fg-clip-681da45d4acfb65c240a6d08)
[](https://huggingface.co/datasets/qihoo360/FineHARD)
[](https://deepwiki.com/360CVGroup/FG-CLIP)
## Data Preparation
To run the inference code for FG-CLIP 2, please follow the following step.
### Step 1: Download the model
#### Model Zoo
|Models | ViT | Model Weights | Demo |
|:-----------|:-----------------------:|:---------------------------------------------------------:|:--------------------------------------------------------:|
| FG-CLIP-Base | vit-base-patch16-224 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip-base) | [Retrieval](https://huggingface.co/spaces/qihoo360/FG-CLIP-Retrieval-demo) & [Dense Feature](https://huggingface.co/spaces/qihoo360/FG-CLIP-Densefeature-demo) |
| FG-CLIP-Large | vit-large-patch14-336 | 🤗[Huggingface](https://huggingface.co/qihoo360/fg-clip-large) | |
| FG-CLIP2-Base | vit-base-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-base) | [Retrieval](https://huggingface.co/spaces/qihoo360/FG-CLIP2-Retrieval-demo) & [Dense Feature](https://huggingface.co/spaces/qihoo360/FG-CLIP2-Densefeature-demo) |
| FG-CLIP2-Large | vit-large-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-large) | |
| FG-CLIP2-So400m | vit-so400m-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-so400m) | |
### Step 2: Prepare LIT-CN Dataset
First, pull the dataset from the following link.
[🤗LIT-CN](https://huggingface.co/datasets/qihoo360/LIT-CN),After downloading, unzip all compressed files, you will obtain the following file structure:
```none
LIT-CN
├── txtfile
| ├── image_caption.txt
├── images
| ├── AIGC
| | ├── t010004b0bada0f11a4.jpg
| │ ├── t010004c6d4819ee63e.jpg
| │ ├── ...
| │ ├── t01fff7e28dcfbb930f.jpg
| ├── AIchallenge
| | ├── 0001cd25094a2a1bcc22a7a37bb73c9077863f76.jpg
| │ ├── 00086160dec706f5ca3065177435f316ede91bc9.jpg
| │ ├── ...
| │ ├── fffd354d8e0cc465ff59db3419209fd691a7d45c.jpg
| ├── muge
| | ├── 0003d729377690c087e35fa2f7eef01a.jpg
| │ ├── 00120afd821d98df982a3afde89c593c.jpg
| │ ├── ...
| │ ├── ffd98c46b1a258cae1f118bc47477528.jpg
```
Benchmarks
|Model| BackBone |I2T|T2I|
| ---- | ---- |---- |---- |
|R2D2|ViT-B/16|35.7|27.4|
|Chinese-CLIP|ViT-B/16|45.7|35.6|
|SigLIP 2|ViT-B/16|4.6|2.6|
|**FG-CLIP 2(ours)**|ViT-B/16|**82.4**|**81.1**|
|R2D2|ViT-L/14|48.3|33.3|
|Chinese-CLIP|ViT-L/14|48.6|38.9|
|SigLIP 2|ViT-L/16|14.8|10.9|
|**FG-CLIP 2(ours)**|ViT-L/16|**86.3**|**85.9**|
|SigLIP 2|ViT-So/16|16.3|11.2|
|MetaCLIP 2|ViT-H/14|77.2|67.6|
|**FG-CLIP 2(ours)**|ViT-So/16|**87.6**|**86.3**|
## Citation
If you find LIT-CN useful for your research and applications, please cite using this BibTeX:
```
@article{xie2025fg2,
title={FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Ao, Ji and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2510.10921},
year={2025}
}
```
```
@article{xie2025fg,
title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2505.05071},
year={2025}
}
```
## License
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
The content of this project itself is licensed under the [Apache license 2.0](./LICENSE).
# FG-CLIP 2:双语细粒度视觉-语言对齐模型
代码仓库:https://github.com/360CVGroup/FG-CLIP
FG-CLIP 2是面向中英双语的细粒度视觉-语言理解基础模型。该模型在29个数据集与8类多样化任务上均持续优于近期的顶尖基线模型(如SigLIP 2与MetaCLIP 2),在双语场景下均达到了目前已公开的最优性能。
**[FG-CLIP 2:双语细粒度视觉-语言对齐模型](https://arxiv.org/abs/2510.10921)**
</br>
谢春宇*,王斌*,孔凡静,李金成,梁大伟,敖骥,冷大伟†,尹玉辉(*为共同第一作者,†为通讯作者)
</br>
[](https://arxiv.org/abs/2510.10921)
[](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
[](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
[](https://research.360.cn/sass/index)
**[FG-CLIP:细粒度视觉与文本对齐](https://arxiv.org/abs/2505.05071)**([代码分支:v1.0](https://github.com/360CVGroup/FG-CLIP/tree/v1.0))
</br>
谢春宇*,王斌*,孔凡静,李金成,梁大伟,张庚申,冷大伟†,尹玉辉(*为共同第一作者,†为通讯作者)
</br>
[](https://arxiv.org/abs/2505.05071)
[](https://icml.cc/Conferences/2025)
[](https://huggingface.co/collections/qihoo360/fg-clip-681da45d4acfb65c240a6d08)
[](https://huggingface.co/datasets/qihoo360/FineHARD)
[](https://deepwiki.com/360CVGroup/FG-CLIP)
## 数据准备
若要运行FG-CLIP 2的推理代码,请按照以下步骤操作。
### 步骤1:下载模型
#### 模型库(Model Zoo)
|模型 | ViT | 模型权重 | 演示 |
|:-----------|:-----------------------:|:---------------------------------------------------------:|:--------------------------------------------------------:|
| FG-CLIP-Base | vit-base-patch16-224 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip-base) | [检索任务](https://huggingface.co/spaces/qihoo360/FG-CLIP-Retrieval-demo) & [稠密特征](https://huggingface.co/spaces/qihoo360/FG-CLIP-Densefeature-demo) |
| FG-CLIP-Large | vit-large-patch14-336 | 🤗[Huggingface](https://huggingface.co/qihoo360/fg-clip-large) | |
| FG-CLIP2-Base | vit-base-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-base) | [检索任务](https://huggingface.co/spaces/qihoo360/FG-CLIP2-Retrieval-demo) & [稠密特征](https://huggingface.co/spaces/qihoo360/FG-CLIP2-Densefeature-demo) |
| FG-CLIP2-Large | vit-large-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-large) | |
| FG-CLIP2-So400m | vit-so400m-patch16 | [🤗Huggingface](https://huggingface.co/qihoo360/fg-clip2-so400m) | |
### 步骤2:准备LIT-CN数据集
首先,请从以下链接获取该数据集:[🤗LIT-CN](https://huggingface.co/datasets/qihoo360/LIT-CN)。下载完成后解压所有压缩包,将得到如下文件结构:
none
LIT-CN
├── txtfile
| ├── image_caption.txt
├── images
| ├── AIGC
| | ├── t010004b0bada0f11a4.jpg
| │ ├── t010004c6d4819ee63e.jpg
| │ ├── ...
| │ ├── t01fff7e28dcfbb930f.jpg
| ├── AIchallenge
| | ├── 0001cd25094a2a1bcc22a7a37bb73c9077863f76.jpg
| │ ├── 00086160dec706f5ca3065177435f316ede91bc9.jpg
| │ ├── ...
| │ ├── fffd354d8e0cc465ff59db3419209fd691a7d45c.jpg
| ├── muge
| | ├── 0003d729377690c087e35fa2f7eef01a.jpg
| │ ├── 00120afd821d98df982a3afde89c593c.jpg
| │ ├── ...
| │ ├── ffd98c46b1a258cae1f118bc47477528.jpg
## 基准测试
|模型| 骨干网络(BackBone) |图像到文本(Image-to-Text,I2T)|文本到图像(Text-to-Image,T2I)|
| ---- | ---- |---- |---- |
|R2D2|ViT-B/16|35.7|27.4|
|Chinese-CLIP|ViT-B/16|45.7|35.6|
|SigLIP 2|ViT-B/16|4.6|2.6|
|**FG-CLIP 2(ours)**|ViT-B/16|**82.4**|**81.1**|
|R2D2|ViT-L/14|48.3|33.3|
|Chinese-CLIP|ViT-L/14|48.6|38.9|
|SigLIP 2|ViT-L/16|14.8|10.9|
|**FG-CLIP 2(ours)**|ViT-L/16|**86.3**|**85.9**|
|SigLIP 2|ViT-So/16|16.3|11.2|
|MetaCLIP 2|ViT-H/14|77.2|67.6|
|**FG-CLIP 2(ours)**|ViT-So/16|**87.6**|**86.3**|
## 引用
若您的研究或应用中用到了LIT-CN数据集,请使用以下BibTeX格式进行引用:
@article{xie2025fg2,
title={FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Ao, Ji and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2510.10921},
year={2025}
}
@article{xie2025fg,
title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2505.05071},
year={2025}
}
## 许可证
本项目使用了部分受原始版权协议约束的数据集与模型权重,使用者需遵守这些原始版权协议的全部条款与条件。本项目本身的内容采用 [Apache许可证2.0](./LICENSE) 进行授权。
提供机构:
maas
创建时间:
2025-10-16



