GRIT
收藏魔搭社区2026-05-15 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/swift/GRIT
下载链接
链接失效反馈官方服务:
资源简介:
# GRIT: Large-Scale Training Corpus of Grounded Image-Text Pairs
### Dataset Description
- **Repository:** [Microsoft unilm](https://github.com/microsoft/unilm/tree/master/kosmos-2)
- **Paper:** [Kosmos-2](https://arxiv.org/abs/2306.14824)
### Dataset Summary
We introduce GRIT, a large-scale dataset of Grounded Image-Text pairs, which is created based on image-text pairs from [COYO-700M](https://github.com/kakaobrain/coyo-dataset) and LAION-2B. We construct a pipeline to extract and link text spans (i.e., noun phrases, and referring expressions) in the caption to their corresponding image regions. More details can be found in the [paper](https://arxiv.org/abs/2306.14824).
### Supported Tasks
During the construction, we excluded the image-caption pairs if no bounding boxes are retained. This procedure resulted in a high-quality image-caption subset of COYO-700M, which we will validate in the future.
Furthermore, this dataset contains text-span-bounding-box pairs. Thus, it can be used in many location-aware mono/multimodal tasks, such as phrase grounding, referring expression comprehension, referring expression generation, and open-world object detection.
### Data Instance
One instance is
```python
{
'key': '000373938',
'clip_similarity_vitb32': 0.353271484375,
'clip_similarity_vitl14': 0.2958984375,
'id': 1795296605919,
'url': "https://www.thestrapsaver.com/wp-content/uploads/customerservice-1.jpg",
'caption': 'a wire hanger with a paper cover that reads we heart our customers',
'width': 1024,
'height': 693,
'noun_chunks': [[19, 32, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 13, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]],
'ref_exps': [[19, 66, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 66, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]]
}
```
- `key`: The generated file name when using img2dataset to download COYO-700M (omit it).
- `clip_similarity_vitb32`: The cosine similarity between text and image(ViT-B/32) embeddings by [OpenAI CLIP](https://github.com/openai/CLIP), provided by COYO-700M.
- `clip_similarity_vitl14`: The cosine similarity between text and image(ViT-L/14) embeddings by [OpenAI CLIP](https://github.com/openai/CLIP), provided by COYO-700M.
- `id`: Unique 64-bit integer ID in COYO-700M.
- `url`: The image URL.
- `caption`: The corresponding caption.
- `width`: The width of the image.
- `height`: The height of the image.
- `noun_chunks`: The noun chunks (extracted by [spaCy](https://spacy.io/)) that have associated bounding boxes (predicted by [GLIP](https://github.com/microsoft/GLIP)). The items in the children list respectively represent 'Start of the noun chunk in caption', 'End of the noun chunk in caption', 'normalized x_min', 'normalized y_min', 'normalized x_max', 'normalized y_max', 'confidence score'.
- `ref_exps`: The corresponding referring expressions. If a noun chunk has no expansion, we just copy it.
### Download image
We recommend to use [img2dataset](https://github.com/rom1504/img2dataset) tool to download the images.
1. Download the metadata. You can download it by cloning current repository:
```bash
git lfs install
git clone https://huggingface.co/datasets/zzliang/GRIT
```
2. Install [img2dataset](https://github.com/rom1504/img2dataset).
```bash
pip install img2dataset
```
3. Download images
You need to replace `/path/to/GRIT_dataset/grit-20m` with the local path to this repository.
```bash
img2dataset --url_list /path/to/GRIT_dataset/grit-20m --input_format "parquet"\
--url_col "url" --caption_col "caption" --output_format webdataset \
--output_folder /tmp/grit --processes_count 4 --thread_count 64 --image_size 256 \
--resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True \
--save_additional_columns '["id","noun_chunks","ref_exps","clip_similarity_vitb32","clip_similarity_vitl14"]' \
--enable_wandb False
```
You can adjust some parameters according to your actual needs (e.g., `processes_count`, `thread_count`, `image_size`, `save_additional_columns`).
More img2dataset hyper-parameters can be found in [here](https://github.com/rom1504/img2dataset#api).
### Citation Information
If you apply this dataset to any project and research, please cite our paper and coyo-700m:
```
@article{Kosmos2,
title={Kosmos-2: Grounding Multimodal Large Language Models to the World},
author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei},
journal={ArXiv},
year={2023},
volume={abs/2306.14824}
}
@misc{kakaobrain2022coyo-700m,
title = {COYO-700M: Image-Text Pair Dataset},
author = {Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim},
year = {2022},
howpublished = {\url{https://github.com/kakaobrain/coyo-dataset}},
}
```
# GRIT:锚定型图像-文本对大规模训练语料库
### 数据集说明
- **项目仓库**:[Microsoft unilm](https://github.com/microsoft/unilm/tree/master/kosmos-2)
- **相关论文**:[Kosmos-2](https://arxiv.org/abs/2306.14824)
### 数据集概览
本文提出GRIT,一款大规模锚定型图像-文本对数据集,其构建依托于[COYO-700M](https://github.com/kakaobrain/coyo-dataset)与LAION-2B中的图像-文本对。我们搭建了一套完整流程,用于提取图像标题中的文本片段(即名词短语与指代表达),并将其与对应图像区域进行锚定关联。更多细节可查阅[相关论文](https://arxiv.org/abs/2306.14824)。
### 支持任务
在数据集构建过程中,我们剔除了未保留边界框的图像-标题对,由此得到了COYO-700M的高质量子集,后续我们将对该子集进行验证。
此外,本数据集包含文本片段-边界框配对数据,因此可适用于多种位置感知的单模态/多模态任务,例如短语锚定(phrase grounding)、指代表达理解、指代表达生成以及开放世界目标检测。
### 数据实例
单条数据实例如下:
python
{
'key': '000373938',
'clip_similarity_vitb32': 0.353271484375,
'clip_similarity_vitl14': 0.2958984375,
'id': 1795296605919,
'url': "https://www.thestrapsaver.com/wp-content/uploads/customerservice-1.jpg",
'caption': 'a wire hanger with a paper cover that reads we heart our customers',
'width': 1024,
'height': 693,
'noun_chunks': [[19, 32, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 13, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]],
'ref_exps': [[19, 66, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 66, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]]
}
- `key`:使用img2dataset下载COYO-700M时生成的文件名(本字段可忽略)。
- `clip_similarity_vitb32`:由COYO-700M提供的、基于[OpenAI CLIP](https://github.com/openai/CLIP)的文本与图像(ViT-B/32)嵌入向量间的余弦相似度。
- `clip_similarity_vitl14`:由COYO-700M提供的、基于[OpenAI CLIP](https://github.com/openai/CLIP)的文本与图像(ViT-L/14)嵌入向量间的余弦相似度。
- `id`:COYO-700M中的唯一64位整数标识符。
- `url`:图像的下载链接。
- `caption`:对应的图像标题。
- `width`:图像的宽度。
- `height`:图像的高度。
- `noun_chunks`:已关联边界框的名词短语(由[spaCy](https://spacy.io/)提取,边界框由[GLIP](https://github.com/microsoft/GLIP)预测)。子列表中的元素依次代表:「标题中名词短语的起始位置」、「标题中名词短语的结束位置」、「归一化x_min」、「归一化y_min」、「归一化x_max」、「归一化y_max」以及「置信度得分」。
- `ref_exps`:对应的指代表达。若名词短语无扩展形式,则直接复制该名词短语。
### 图像下载
我们推荐使用[img2dataset](https://github.com/rom1504/img2dataset)工具完成图像下载,具体步骤如下:
1. 下载元数据:可通过克隆本仓库获取:
bash
git lfs install
git clone https://huggingface.co/datasets/zzliang/GRIT
2. 安装img2dataset:
bash
pip install img2dataset
3. 下载图像
你需要将命令中的`/path/to/GRIT_dataset/grit-20m`替换为该仓库的本地路径。
bash
img2dataset --url_list /path/to/GRIT_dataset/grit-20m --input_format "parquet"
--url_col "url" --caption_col "caption" --output_format webdataset
--output_folder /tmp/grit --processes_count 4 --thread_count 64 --image_size 256
--resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
--save_additional_columns '["id","noun_chunks","ref_exps","clip_similarity_vitb32","clip_similarity_vitl14"]'
--enable_wandb False
你可根据实际需求调整部分参数(例如`processes_count`、`thread_count`、`image_size`、`save_additional_columns`)。更多img2dataset的超参数可查阅[官方文档](https://github.com/rom1504/img2dataset#api)。
### 引用信息
若您将本数据集用于项目或研究,请引用如下论文与COYO-700M:
@article{Kosmos2,
title={Kosmos-2: Grounding Multimodal Large Language Models to the World},
author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei},
journal={ArXiv},
year={2023},
volume={abs/2306.14824}
}
@misc{kakaobrain2022coyo-700m,
title = {COYO-700M: Image-Text Pair Dataset},
author = {Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim},
year = {2022},
howpublished = {url{https://github.com/kakaobrain/coyo-dataset}},
}
提供机构:
maas
创建时间:
2024-06-05



