cota-mantis
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/cota-mantis
下载链接
链接失效反馈官方服务:
资源简介:
# 🌮 TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
<h3 align="left"> <a href="https://taco-project.github.io/">🌐 Website</a> | <a href="https://arxiv.org/pdf/2412.05479">📑 Arxiv</a> | <a href="https://github.com/SalesforceAIResearch/CoTA">💻 Code</a>| <a href="https://huggingface.co/collections/Salesforce/cota-datasets-675333e57dd34a4adc5f3ff4">🤗 Datasets</a>
<h5 align="left"> If you like our project or are interested in its updates, please star us :) Thank you! ⭐ </h2>
## Summary
TLDR: CoTA is a large-scale dataset of synthetic Chains-of-Thought-and-Action (CoTA) generated by multi-modal large language models.
## Load data
```
from datasets import load_dataset
dataset = load_dataset("Salesforce/cota-mantis", split="cota_293k")
```
## Dataset Card
### Dataset Details
This dataset contains synthetic chains of thoughts and actions involving 15 actions:```OCR```, ```LocalizeObjects```, ```GetObjects```,
```EstimateRegionDepth```, ```EstimateObjectDepth```, ```Crop```, ```ZoomIn```, ```QueryLanguageModel```, ```GetImageToImagesSimilarity```, ```GetImageToTextsSimilarity```,
```GetTextToImagesSimilarity```, ```DetectFaces```, ```QueryKnowledgeBase```, ```Calculate```, and ```SolveMathEquation```. Additionally, the ```Terminate``` action
is added for the model to provide a final answer. You can find the detailed statistics of this dataset,
including the data sources distribution, the average and max number of images and turns below:
<img src="dataset_stats.png" alt="dataset stats" width="800"/>
<!-- ### Dataset Sources
- **Cauldron:**
- **Mantis-Instruct:**
-->
### Uses
<!-- Address questions around how the dataset is intended to be used. -->
The intended use of this dataset is to finetune multi-modal language models to produce chains of thoughts and actions to answer difficult and complex visual questions.
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
You can directly use this dataset to train Mantis-based models with our [codebase](https://github.com/SalesforceAIResearch/TACO). To train LLaVA-OneVision models, please use ```cota-llava``` in the [collection](https://huggingface.co/collections/Salesforce/cota-datasets-675333e57dd34a4adc5f3ff4).
To train other multi-modal language models, you might need to adapt the conversation format to work for your particular models.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
This dataset should not be used for testing models.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
The source data comes from [Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and [Mantis-Instruct](https://huggingface.co/datasets/TIGER-Lab/Mantis-Instruct).
They are collected from various existing datasets, including COCO, AOKVQA, ScienceQA, Visual Genome, etc.
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
<img src="data_gen.png" width=1000>
<!--  -->
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Our dataset has the following limitations:
- The chains of thoughts and actions are generated by gpt-4o-2024-08-06 and thus inherit its biases;
- The actions are somewhat limited as they cover mostly vision-centric tools such as DepthEstimation and some generic tools such as QueryKnowledgeBase.
- Please refer to the paper for additional limitations.
## License
The CoTA datasets are licensed under the noncommerical license [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This release is for research purposes only in support of an academic paper.
## Citation
```
@misc{ma2024tacolearningmultimodalaction,
title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action},
author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
year={2024},
eprint={2412.05479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.05479},
}
```
# 🌮 TACO: 基于合成思维-行动链的多模态动作模型学习
<h3 align="left"> <a href="https://taco-project.github.io/">🌐 项目主页</a> | <a href="https://arxiv.org/pdf/2412.05479">📑 Arxiv论文</a> | <a href="https://github.com/SalesforceAIResearch/CoTA">💻 代码仓库</a>| <a href="https://huggingface.co/collections/Salesforce/cota-datasets-675333e57dd34a4adc5f3ff4">🤗 数据集集合</a>
<h5 align="left"> 如果您喜欢我们的项目或关注其更新,请为我们点亮星标 :) 感谢您的支持! ⭐ </h5>
## 摘要
TLDR: CoTA是一个由多模态大语言模型生成的大规模合成思维-行动链(Chains-of-Thought-and-Action,以下简称CoTA)数据集。
## 数据加载
from datasets import load_dataset
dataset = load_dataset("Salesforce/cota-mantis", split="cota_293k")
## 数据集卡片
### 数据集详情
本数据集包含涉及15种动作的合成思维与行动链,分别为:光学字符识别(Optical Character Recognition,OCR)、目标定位(LocalizeObjects)、获取目标(GetObjects)、区域深度估计(EstimateRegionDepth)、目标深度估计(EstimateObjectDepth)、裁剪(Crop)、放大(ZoomIn)、查询大语言模型(QueryLanguageModel)、图像-图像相似度计算(GetImageToImagesSimilarity)、图像-文本相似度计算(GetImageToTextsSimilarity)、文本-图像相似度计算(GetTextToImagesSimilarity)、人脸检测(DetectFaces)、知识库查询(QueryKnowledgeBase)、数值计算(Calculate)以及数学方程求解(SolveMathEquation)。此外,为便于模型输出最终答案,还新增了终止(Terminate)动作。您可在下方查阅本数据集的详细统计信息,包括数据源分布、平均与最大图像数量及对话轮次等:
<img src="dataset_stats.png" alt="数据集统计信息" width="800"/>
<!-- ### 数据集来源
- **Cauldron:**
- **Mantis-Instruct:**
-->
### 数据集用途
本数据集的预设用途为:微调多模态语言模型,使其能够生成思维-行动链以解答复杂且高难度的视觉类问题。
### 直接使用场景
您可直接使用本数据集结合我们的[代码仓库](https://github.com/SalesforceAIResearch/TACO)训练基于Mantis的模型。若需训练LLaVA-OneVision模型,请使用数据集集合中的`cota-llava`分支。若需训练其他多模态语言模型,您可能需要根据特定模型调整对话格式以适配。
### 禁止使用场景
本数据集不得用于模型测试。
### 源数据
本数据集的源数据来自[Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)与[Mantis-Instruct](https://huggingface.co/datasets/TIGER-Lab/Mantis-Instruct),二者均采集自多个现有数据集,包括COCO、AOKVQA、ScienceQA、Visual Genome等。
#### 数据收集与处理
<img src="data_gen.png" width=1000>
## 偏差、风险与局限性
本数据集存在以下局限性:
- 思维-行动链由gpt-4o-2024-08-06生成,因此会继承其固有偏差;
- 所覆盖的动作类型较为有限,主要以以视觉为中心的工具(如深度估计)以及部分通用工具(如知识库查询)为主。
- 更多局限性请参阅论文原文。
## 许可证
CoTA数据集采用非商业性许可证[CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)进行授权。用户需自行评估其使用过程中需承担的相关义务与责任,该授权需符合原始数据集及相关数据的对应许可证或条款要求。本数据集仅用于支持学术论文的研究用途。
## 引用
@misc{ma2024tacolearningmultimodalaction,
title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action},
author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
year={2024},
eprint={2412.05479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.05479},
}
提供机构:
maas
创建时间:
2025-08-17



