deepcs233/Visual-CoT
收藏Hugging Face2024-06-13 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/deepcs233/Visual-CoT
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# VisCoT Dataset Card


There is a shortage of multimodal datasets for training multi-modal large language models (MLLMs) that require to identify specific regions in an image for additional attention to improve response performance. This type of dataset with grounding bbox annotations could possibly help the MLLM output intermediate interpretable attention area and enhance performance.
To fill the gap, we curate a visual CoT dataset. **This dataset specifically focuses on identifying critical regions within images, a feature essential for models to concentrate on relevant visual elements to improve response accuracy. Each data sample consists of a question, answer, and a corresponding visual bounding box across five domains. Some data samples also include extra detailed reasoning steps.**
To ensure a robust foundation for detailed visual and textual analysis, our dataset deliberately integrates a diverse selection of data including **text/doc, fine-grained understanding, charts, general VQA, and relation reasoning**. These data domains are deliberately chosen to cultivate a comprehensive skill set across varied analytical tasks: 1) Text/doc enhances MLLM's capabilities on OCR and contextual understanding, crucial for applications requiring text interpretation in complex environments. 2) Fine-grained understanding aids in identifying and distinguishing subtle differences in visual appearance and patterns. 3) Charts foster the ability to interpret graphical data, which are essential for business and scientific applications. 4) General VQA exposes models to a wide array of visual queries, improving their general usability. 5) Relation reasoning data develops spatial and contextual awareness of MLLMs, vital for interactive and navigational tasks. Together, these modalities ensure the dataset not only fills existing gaps but also enhances the versatility and contextual awareness of MLLMs across varied scenarios.
## Dataset details
- `viscot_363k.json`: the data list which only contains VisCoT-related training data
- `viscot_mixed_2m.json`: the mixed data list for reproducing the VisCoT
- `metadata/`: metainfo folder, including more raw and detailed information and annotations
- `cub_cot_train.jsonl`: metainfo for the CUB dataset
- `docvqa_cot_train.jsonl`: metainfo for the DocVQA dataset
- ...
**Dataset date:**
VisCoT-1.0 Dataset was collected in June 2024.
**Paper or resources for more information:**
Github: https://github.com/deepcs233/Visual-CoT
Paper: https://arxiv.org/abs/2403.16999
**License:**
Attribution-NonCommercial 4.0 International
**Where to send questions or comments about the model:**
https://github.com/deepcs233/Visual-CoT/issues
## Disclaimer
This dataset was collected and released solely for research purposes, with the goal of making the MLLMs dynamically focus on visual inputs and provide intermediate interpretable thoughts. The authors are strongly against any potential harmful use of the data or technology to any party.
### Intended Use
The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper. The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes.
### Primary Intended Use
The primary intended use is to support AI researchers reproducing and building on top of this work. \shortname{} and its associated models should be helpful for exploring various vision question answering (VQA) research questions.
### Out-of-Scope Use
Any deployed use case of the model --- commercial or otherwise --- is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended for research use only and not intended for deployed use cases.
The VisCoT Dataset is curated to fill the gap of multimodal datasets required for training multi-modal large language models (MLLMs) that need to identify specific regions in images for improved response performance. This dataset specifically focuses on identifying critical regions within images and includes data samples with questions, answers, and corresponding visual bounding boxes across five domains: text/doc, fine-grained understanding, charts, general VQA, and relation reasoning. The dataset deliberately integrates a diverse selection of data to cultivate a comprehensive skill set across varied analytical tasks. It includes specific files and metadata for training and reproducing the VisCoT, and was collected in June 2024 under an Attribution-NonCommercial 4.0 International license. The intended use of the dataset is to support future research in visual-language processing and reproducibility of experimental results, with a strong disclaimer against any potential harmful use.
提供机构:
deepcs233
原始信息汇总
数据集概述
许可证信息
- 许可证类型:Apache-2.0
搜集汇总
数据集介绍

构建方式
在视觉语言处理领域,多模态大语言模型(MLLMs)的发展亟需能够引导模型关注图像关键区域的标注数据。VisCoT数据集的构建旨在填补这一空白,通过精心策划一个包含视觉边界框注释的数据集。该数据集整合了来自五个不同领域的数据样本,每个样本均包含问题、答案及对应的视觉边界框,部分样本还提供了详细的推理步骤。数据收集过程注重多样性,涵盖了文本/文档、细粒度理解、图表、通用视觉问答(VQA)以及关系推理等多个模态,以确保模型能够在广泛的分析任务中获得全面的能力提升。
特点
VisCoT数据集的核心特点在于其专注于图像关键区域的识别,并提供了丰富的多模态注释。数据集不仅包含视觉边界框,以引导模型动态聚焦于相关视觉元素,还融入了详细的推理步骤,增强了模型输出的可解释性。此外,数据集覆盖了文本理解、细粒度视觉分析、图表解读、通用视觉问答及关系推理五大领域,这种多模态设计显著提升了模型的上下文感知能力和泛化性能,使其能够适应复杂多变的实际应用场景。
使用方法
VisCoT数据集主要用于支持多模态大语言模型的研究与开发,特别是在视觉问答(VQA)任务中提升模型的注意力机制和推理能力。研究人员可通过加载提供的JSON文件(如viscot_363k.json或viscot_mixed_2m.json)来访问数据集,其中包含了问题、答案、边界框及可选推理步骤。该数据集适用于模型训练、评估及复现相关实验,旨在促进视觉语言处理领域的学术探索,但需注意其仅限于研究用途,不适用于临床决策或商业部署。
背景与挑战
背景概述
在2024年6月,由deepcs233团队发布的Visual-CoT数据集,标志着多模态大语言模型研究领域的重要进展。该数据集旨在解决现有视觉问答任务中模型难以精准定位图像关键区域的瓶颈,通过引入带有边界框标注的视觉思维链数据,为模型提供可解释的中间注意力机制。其核心研究问题聚焦于提升模型在复杂视觉场景下的推理能力,涵盖文本文档、细粒度理解、图表分析、通用视觉问答及关系推理五大领域,从而增强模型在多样化任务中的泛化性能与上下文感知力,对推动多模态人工智能的实用化发展具有显著影响力。
当前挑战
Visual-CoT数据集所应对的领域挑战,在于多模态大语言模型在视觉问答任务中难以动态聚焦图像关键区域,导致回答准确性受限。构建过程中的挑战则体现为数据标注的复杂性,需在五大异构领域内精确标注边界框与推理步骤,确保数据多样性与质量平衡;同时,数据整合需克服不同来源的格式差异与语义一致性难题,以支撑模型在OCR、细粒度识别等任务中的鲁棒学习。
常用场景
经典使用场景
在视觉语言处理领域,VisCoT数据集为多模态大语言模型提供了关键的区域定位训练资源。其经典使用场景在于通过边界框标注引导模型在回答视觉问题时动态聚焦于图像中的特定区域,从而模拟人类的视觉推理过程。这种设计使得模型能够在处理复杂视觉查询时,先识别出图像中的相关元素,再基于这些元素生成精确答案,显著提升了模型在细粒度理解和关系推理任务中的表现。
实际应用
在实际应用中,VisCoT数据集可广泛应用于需要结合视觉与文本分析的智能系统。例如,在文档理解场景中,模型能够精准定位并解读图像中的文字信息;在医疗或工业检测领域,其细粒度理解能力有助于识别微妙的视觉模式差异;此外,在商业智能分析中,模型可借助图表解析模块从图形数据中提取关键洞察。这些应用显著提升了自动化系统的交互效率和决策准确性。
衍生相关工作
基于VisCoT数据集,研究者们已衍生出一系列经典工作,主要集中在增强多模态模型的可解释性与性能。例如,有研究利用其边界框注释开发了视觉注意力引导的推理框架,使模型能够逐步聚焦于图像关键区域;另一些工作则整合了数据集的多元领域特征,构建了更通用的视觉语言预训练模型。这些衍生工作不仅验证了数据集的实用价值,还进一步推动了视觉推理技术在学术与工程领域的创新。
以上内容由遇见数据集搜集并总结生成



