FineHARD
收藏魔搭社区2026-01-02 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/360zhinao/FineHARD
下载链接
链接失效反馈官方服务:
资源简介:
# FG-CLIP: Fine-Grained Visual and Textual Alignment
**[FG-CLIP: Fine-Grained Visual and Textual Alignment](https://arxiv.org/abs/2505.05071)**
</br>
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)
</br>
[](https://arxiv.org/abs/2505.05071)
[](https://icml.cc/Conferences/2025)
[](https://github.com/360CVGroup/FG-CLIP)
<p align="center">
<img src="https://huggingface.co/qihoo360/fg-clip-large/resolve/main/radar_chart_methods.png" width="500" height="440"/>
</p>
## Model Framework
FG-CLIP’s training proceeds in two stages: the first stage leverages
global-level caption-image pairs to achieve initial fine-grained alignment, while the second stage supplements these with additional
region-level captions, including detailed region captions and positive/negative region descriptions to further refine the alignment.
<p align="center">
<img src="https://huggingface.co/qihoo360/fg-clip-large/resolve/main/fgclip_strc.png" width=80%/>
</p>
# Data Preparation
To run the training code for FG-CLIP, please follow the following step.
### Step 1: Download the model
Download the FG-CLIP model from this link. [🤗Vit-L@336px](https://huggingface.co/qihoo360/fg-clip-large) or
Download the OpenAI CLIP model from this link. [🤗Vit-L@336px](https://huggingface.co/openai/clip-vit-large-patch14-336)
### Step 2: Prepare FineHARD (Fine-Grained Visual Grounding+Recaption+Hard Negative Dataset) Dataset
First, pull the dataset from the following link.
[🤗FineHARD](https://huggingface.co/datasets/qihoo360/FineHARD),After downloading, unzip all compressed files, you will obtain the following file structure:
```none
FineHARD
├── url2key_jsons
| ├── url2key_coyo_image_0.json
| ├── ...
│ ├── url2key_coyo_image_20.json
├── jsonfiles
| ├── 2024-12-06_18-32-53_results_10_218_126_44_1025.json
│ ├── 2024-12-06_18-33-17_results_llama70b-shcdt-h100-4gpus-no-2.json
│ ├──...
├── coyo_image_0
| ├── 00000.parquet
│ ├── 00001.parquet
│ ├── ...
│ ├── 00099.parquet
├── coyo_image_1
| ├── 00000.parquet
│ ├── 00001.parquet
│ ├── ...
│ ├── 00099.parquet
├── ...
├── coyo_image_20
| ├── 00000.parquet
│ ├── 00001.parquet
│ ├── ...
│ ├── 00050.parquet
├── ...
```
Subsequently, you need to install the `img2dataset` package. You can do this by running the following command:
```bash
pip install img2dataset
```
Set the `file_in` parameter in the script (`data/get_data.sh`) according to the download path of the data, and also set the directory where you expect to save the files (`pre_dir`, `dir_save`). Subsequently, execute the following commands.
```bash
bash data/get_data.sh
```
Due to the randomness in downloading, the image names corresponding to the URLs do not match the names of the images we are using. Therefore, a conversion is needed. This step requires using the `url2key_jsons/*.json` file included in the FineHARD dataset. Also, you can use the files in `url2key_jsons/*.json` to check the download links of all the images we used.
```bash
python -m data.convert_image_name \
--url2key_json FineHARD/url2key_jsons \
--down_file_root data/down-grit-12m/ \
--num_parent_folders 21 \
--num_subfolders_per_parent 100 \
--resave_file_root data/grit-12m/ \
rm -r data/down-grit-12m/
```
```none
FG-CLIP
├── ...
├── FineHARD
| ├── jsonfiles
| | ├── 2024-12-06_18-32-53_results_10_218_126_44_1025.json
| | ├── 2024-12-06_18-33-17_results_llama70b-shcdt-h100-4gpus-no-2.json
| | ├──...
| ├── ...
├── data
| ├── grit-12m
| | ├── coyo_image_0
| | | ├──00000
| | | ├──00001
| | | ├──...
| | | ├──00099
| | ├── coyo_image_1
| | | ├──00000
| | | ├──00001
| | | ├──...
| | | ├──00099
| | ├── ...
| | ├── coyo_image_20
| | | ├──00000
| | | ├──00001
| | | ├──...
| | | ├──00050
├── ...
```
## Citation
If you find FineHARD useful for your research and applications, please cite using this BibTeX:
```
@article{xie2025fg,
title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2505.05071},
year={2025}
}
```
# FG-CLIP:细粒度视觉与文本对齐
**[FG-CLIP:细粒度视觉与文本对齐](https://arxiv.org/abs/2505.05071)**
</br>
谢春宇*,王斌*,孔繁静,李金成,梁大伟,张耿申,冷大伟†,尹玉辉(* 共同第一作者,† 通讯作者)
</br>
[](https://arxiv.org/abs/2505.05071)
[](https://icml.cc/Conferences/2025)
[](https://github.com/360CVGroup/FG-CLIP)
<p align="center">
<img src="https://huggingface.co/qihoo360/fg-clip-large/resolve/main/radar_chart_methods.png" width="500" height="440"/>
</p>
## 模型框架
FG-CLIP的训练分为两个阶段:第一阶段利用全局级别的图文对实现初步的细粒度对齐;第二阶段则补充额外的区域级文本描述,包括精细化的区域标题以及正负区域描述,以进一步优化对齐效果。
<p align="center">
<img src="https://huggingface.co/qihoo360/fg-clip-large/resolve/main/fgclip_strc.png" width=80%/>
</p>
## 数据准备
若要运行FG-CLIP的训练代码,请遵循以下步骤。
### 步骤1:下载模型
从以下链接下载FG-CLIP模型:[🤗Vit-L@336px](https://huggingface.co/qihoo360/fg-clip-large);或从以下链接下载OpenAI CLIP模型:[🤗Vit-L@336px](https://huggingface.co/openai/clip-vit-large-patch14-336)
### 步骤2:准备FineHARD(细粒度视觉定位+重描述+难负样本数据集)
首先,从以下链接拉取该数据集:[🤗FineHARD](https://huggingface.co/datasets/qihoo360/FineHARD)。下载完成后解压所有压缩文件,将得到如下文件目录结构:
none
FineHARD
├── url2key_jsons
| ├── url2key_coyo_image_0.json
| ├── ...
│ ├── url2key_coyo_image_20.json
├── jsonfiles
| ├── 2024-12-06_18-32-53_results_10_218_126_44_1025.json
│ ├── 2024-12-06_18-33-17_results_llama70b-shcdt-h100-4gpus-no-2.json
│ ├──...
├── coyo_image_0
| ├── 00000.parquet
│ ├── 00001.parquet
│ ├── ...
│ ├── 00099.parquet
├── coyo_image_1
| ├── 00000.parquet
│ ├── 00001.parquet
│ ├── ...
│ ├── 00099.parquet
├── ...
├── coyo_image_20
| ├── 00000.parquet
│ ├── 00001.parquet
│ ├── ...
│ ├── 00050.parquet
├── ...
随后,需安装`img2dataset`工具包,可通过执行以下命令完成安装:
bash
pip install img2dataset
根据数据集的下载路径,修改脚本`data/get_data.sh`中的`file_in`参数,并设置期望的文件保存目录(`pre_dir`与`dir_save`)。随后执行以下命令:
bash
bash data/get_data.sh
由于下载过程存在随机性,URL对应的图像文件名与我们使用的图像文件名并不一致,因此需要进行名称转换。此步骤需使用FineHARD数据集中的`url2key_jsons/*.json`文件,同时也可通过该目录下的JSON文件查看本次实验所用全部图像的下载链接。
bash
python -m data.convert_image_name
--url2key_json FineHARD/url2key_jsons
--down_file_root data/down-grit-12m/
--num_parent_folders 21
--num_subfolders_per_parent 100
--resave_file_root data/grit-12m/
rm -r data/down-grit-12m/
最终的FG-CLIP项目目录结构如下:
none
FG-CLIP
├── ...
├── FineHARD
| ├── jsonfiles
| | ├── 2024-12-06_18-32-53_results_10_218_126_44_1025.json
| | ├── 2024-12-06_18-33-17_results_llama70b-shcdt-h100-4gpus-no-2.json
| | ├──...
| ├── ...
├── data
| ├── grit-12m
| | ├── coyo_image_0
| | | ├──00000
| | | ├──00001
| | | ├──...
| | | ├──00099
| | ├── coyo_image_1
| | | ├──00000
| | | ├──00001
| | | ├──...
| | | ├──00099
| | ├── ...
| | ├── coyo_image_20
| | | ├──00000
| | | ├──00001
| | | ├──...
| | | ├──00050
├── ...
## 引用
若您的研究与应用中使用了FineHARD数据集,请通过以下BibTeX格式引用:
@article{xie2025fg,
title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2505.05071},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-16



