GRIT

Name: GRIT
Creator: maas
Published: 2026-05-15 13:43:02
License: 暂无描述

魔搭社区2026-05-15 更新2024-06-08 收录

下载链接：

https://modelscope.cn/datasets/swift/GRIT

下载链接

链接失效反馈

官方服务：

资源简介：

# GRIT: Large-Scale Training Corpus of Grounded Image-Text Pairs ### Dataset Description - **Repository:** [Microsoft unilm](https://github.com/microsoft/unilm/tree/master/kosmos-2) - **Paper:** [Kosmos-2](https://arxiv.org/abs/2306.14824) ### Dataset Summary We introduce GRIT, a large-scale dataset of Grounded Image-Text pairs, which is created based on image-text pairs from [COYO-700M](https://github.com/kakaobrain/coyo-dataset) and LAION-2B. We construct a pipeline to extract and link text spans (i.e., noun phrases, and referring expressions) in the caption to their corresponding image regions. More details can be found in the [paper](https://arxiv.org/abs/2306.14824). ### Supported Tasks During the construction, we excluded the image-caption pairs if no bounding boxes are retained. This procedure resulted in a high-quality image-caption subset of COYO-700M, which we will validate in the future. Furthermore, this dataset contains text-span-bounding-box pairs. Thus, it can be used in many location-aware mono/multimodal tasks, such as phrase grounding, referring expression comprehension, referring expression generation, and open-world object detection. ### Data Instance One instance is ```python { 'key': '000373938', 'clip_similarity_vitb32': 0.353271484375, 'clip_similarity_vitl14': 0.2958984375, 'id': 1795296605919, 'url': "https://www.thestrapsaver.com/wp-content/uploads/customerservice-1.jpg", 'caption': 'a wire hanger with a paper cover that reads we heart our customers', 'width': 1024, 'height': 693, 'noun_chunks': [[19, 32, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 13, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]], 'ref_exps': [[19, 66, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 66, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]] } ``` - `key`: The generated file name when using img2dataset to download COYO-700M (omit it). - `clip_similarity_vitb32`: The cosine similarity between text and image(ViT-B/32) embeddings by [OpenAI CLIP](https://github.com/openai/CLIP), provided by COYO-700M. - `clip_similarity_vitl14`: The cosine similarity between text and image(ViT-L/14) embeddings by [OpenAI CLIP](https://github.com/openai/CLIP), provided by COYO-700M. - `id`: Unique 64-bit integer ID in COYO-700M. - `url`: The image URL. - `caption`: The corresponding caption. - `width`: The width of the image. - `height`: The height of the image. - `noun_chunks`: The noun chunks (extracted by [spaCy](https://spacy.io/)) that have associated bounding boxes (predicted by [GLIP](https://github.com/microsoft/GLIP)). The items in the children list respectively represent 'Start of the noun chunk in caption', 'End of the noun chunk in caption', 'normalized x_min', 'normalized y_min', 'normalized x_max', 'normalized y_max', 'confidence score'. - `ref_exps`: The corresponding referring expressions. If a noun chunk has no expansion, we just copy it. ### Download image We recommend to use [img2dataset](https://github.com/rom1504/img2dataset) tool to download the images. 1. Download the metadata. You can download it by cloning current repository: ```bash git lfs install git clone https://huggingface.co/datasets/zzliang/GRIT ``` 2. Install [img2dataset](https://github.com/rom1504/img2dataset). ```bash pip install img2dataset ``` 3. Download images You need to replace `/path/to/GRIT_dataset/grit-20m` with the local path to this repository. ```bash img2dataset --url_list /path/to/GRIT_dataset/grit-20m --input_format "parquet"\ --url_col "url" --caption_col "caption" --output_format webdataset \ --output_folder /tmp/grit --processes_count 4 --thread_count 64 --image_size 256 \ --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True \ --save_additional_columns '["id","noun_chunks","ref_exps","clip_similarity_vitb32","clip_similarity_vitl14"]' \ --enable_wandb False ``` You can adjust some parameters according to your actual needs (e.g., `processes_count`, `thread_count`, `image_size`, `save_additional_columns`). More img2dataset hyper-parameters can be found in [here](https://github.com/rom1504/img2dataset#api). ### Citation Information If you apply this dataset to any project and research, please cite our paper and coyo-700m: ``` @article{Kosmos2, title={Kosmos-2: Grounding Multimodal Large Language Models to the World}, author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei}, journal={ArXiv}, year={2023}, volume={abs/2306.14824} } @misc{kakaobrain2022coyo-700m, title = {COYO-700M: Image-Text Pair Dataset}, author = {Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim}, year = {2022}, howpublished = {\url{https://github.com/kakaobrain/coyo-dataset}}, } ```

# GRIT：锚定型图像-文本对大规模训练语料库 ### 数据集说明 - **项目仓库**：[Microsoft unilm](https://github.com/microsoft/unilm/tree/master/kosmos-2) - **相关论文**：[Kosmos-2](https://arxiv.org/abs/2306.14824) ### 数据集概览本文提出GRIT，一款大规模锚定型图像-文本对数据集，其构建依托于[COYO-700M](https://github.com/kakaobrain/coyo-dataset)与LAION-2B中的图像-文本对。我们搭建了一套完整流程，用于提取图像标题中的文本片段（即名词短语与指代表达），并将其与对应图像区域进行锚定关联。更多细节可查阅[相关论文](https://arxiv.org/abs/2306.14824)。 ### 支持任务在数据集构建过程中，我们剔除了未保留边界框的图像-标题对，由此得到了COYO-700M的高质量子集，后续我们将对该子集进行验证。此外，本数据集包含文本片段-边界框配对数据，因此可适用于多种位置感知的单模态/多模态任务，例如短语锚定（phrase grounding）、指代表达理解、指代表达生成以及开放世界目标检测。 ### 数据实例单条数据实例如下： python { 'key': '000373938', 'clip_similarity_vitb32': 0.353271484375, 'clip_similarity_vitl14': 0.2958984375, 'id': 1795296605919, 'url': "https://www.thestrapsaver.com/wp-content/uploads/customerservice-1.jpg", 'caption': 'a wire hanger with a paper cover that reads we heart our customers', 'width': 1024, 'height': 693, 'noun_chunks': [[19, 32, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 13, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]], 'ref_exps': [[19, 66, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 66, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]] } - `key`：使用img2dataset下载COYO-700M时生成的文件名（本字段可忽略）。 - `clip_similarity_vitb32`：由COYO-700M提供的、基于[OpenAI CLIP](https://github.com/openai/CLIP)的文本与图像（ViT-B/32）嵌入向量间的余弦相似度。 - `clip_similarity_vitl14`：由COYO-700M提供的、基于[OpenAI CLIP](https://github.com/openai/CLIP)的文本与图像（ViT-L/14）嵌入向量间的余弦相似度。 - `id`：COYO-700M中的唯一64位整数标识符。 - `url`：图像的下载链接。 - `caption`：对应的图像标题。 - `width`：图像的宽度。 - `height`：图像的高度。 - `noun_chunks`：已关联边界框的名词短语（由[spaCy](https://spacy.io/)提取，边界框由[GLIP](https://github.com/microsoft/GLIP)预测）。子列表中的元素依次代表：「标题中名词短语的起始位置」、「标题中名词短语的结束位置」、「归一化x_min」、「归一化y_min」、「归一化x_max」、「归一化y_max」以及「置信度得分」。 - `ref_exps`：对应的指代表达。若名词短语无扩展形式，则直接复制该名词短语。 ### 图像下载我们推荐使用[img2dataset](https://github.com/rom1504/img2dataset)工具完成图像下载，具体步骤如下： 1. 下载元数据：可通过克隆本仓库获取： bash git lfs install git clone https://huggingface.co/datasets/zzliang/GRIT 2. 安装img2dataset： bash pip install img2dataset 3. 下载图像你需要将命令中的`/path/to/GRIT_dataset/grit-20m`替换为该仓库的本地路径。 bash img2dataset --url_list /path/to/GRIT_dataset/grit-20m --input_format "parquet" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder /tmp/grit --processes_count 4 --thread_count 64 --image_size 256 --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True --save_additional_columns '["id","noun_chunks","ref_exps","clip_similarity_vitb32","clip_similarity_vitl14"]' --enable_wandb False 你可根据实际需求调整部分参数（例如`processes_count`、`thread_count`、`image_size`、`save_additional_columns`）。更多img2dataset的超参数可查阅[官方文档](https://github.com/rom1504/img2dataset#api)。 ### 引用信息若您将本数据集用于项目或研究，请引用如下论文与COYO-700M： @article{Kosmos2, title={Kosmos-2: Grounding Multimodal Large Language Models to the World}, author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei}, journal={ArXiv}, year={2023}, volume={abs/2306.14824} } @misc{kakaobrain2022coyo-700m, title = {COYO-700M: Image-Text Pair Dataset}, author = {Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim}, year = {2022}, howpublished = {url{https://github.com/kakaobrain/coyo-dataset}}, }

提供机构：

maas

创建时间：

2024-06-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集