LLaVA-CoT-100k
收藏魔搭社区2026-05-19 更新2024-12-07 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/LLaVA-CoT-100k
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for LLaVA-CoT
The **LLaVA-CoT-100k** dataset is introduced in the paper [LLaVA-CoT: Let Vision Language Models Reason Step-by-Step](https://huggingface.co/papers/2411.10440). This dataset is designed to enable Vision-Language Models (VLMs) to perform autonomous multistage reasoning, integrating samples from various visual question-answering sources with structured reasoning annotations. It aims to address the challenges VLMs face in systematic and structured reasoning for complex visual question-answering tasks.
## Dataset Sources
- **Repository:** [https://github.com/PKU-YuanGroup/LLaVA-CoT](https://github.com/PKU-YuanGroup/LLaVA-CoT)
- **Paper:** [https://arxiv.org/abs/2411.10440](https://arxiv.org/abs/2411.10440)
## Sample Usage
You can easily load the dataset using the Hugging Face `datasets` library, and then follow the project's instructions for setting up images and using the data.
**1. Load with Hugging Face `datasets` Library:**
```python
from datasets import load_dataset
# Load the LLaVA-CoT-100k dataset
dataset = load_dataset("Xkev/LLaVA-CoT-100k")
# Access the training split
train_split = dataset["train"]
# Print an example
print(train_split[0])
```
**2. Prepare Images Locally:**
The repository includes `image.zip.part-{aa-ap}` files. You need to merge these manually to get the full image archive:
```bash
cat image.zip.part-* > image.zip
unzip image.zip
```
**3. Inference:**
You can use the same code as Llama-3.2-11B-Vision-Instruct to load the model and perform inference. For detailed instructions on test-time stage-wise retracing search (SWIRES), refer to the `inference/README.md` file in the [GitHub repository](https://github.com/PKU-YuanGroup/LLaVA-CoT/blob/main/inference/README.md).
**4. Finetuning:**
To reproduce the paper's results, you can use the provided finetuning script with `llama-recipes`. Remember to modify the `data_path` and `image_base_path` in `train/cot_dataset.py` to your own local path to the training dataset.
```bash
cd train
pip install llama-recipes
torchrun --nnodes 1 --nproc_per_node 8 --master_port 29500 finetuning.py \
--enable_fsdp --lr 1e-5 --num_epochs 3 --batch_size_training 4 \
--model_name meta-llama/Llama-3.2-11B-Vision-Instruct \
--dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder LLaVA-CoT \
--use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" \
--custom_dataset.file "datasets/cot_dataset.py" --run_validation False \
--batching_strategy padding
```
## Dataset Structure
The `train.jsonl` file contains the question-answering data in the following format:
```json
{
"id": ID,
"image": IMAGE_PATH,
"conversations": [{"from": "human", "value": QUESTION},{"from": "gpt", "value": ANSWER}]
}
```
## Dataset Creation
We utilized images and questions from open-source datasets. The distribution is as follows:
| **Dataset** | **Type** | **Size** |
|---------------------|------------------------|-----------|
| ShareGPT4V | General VQA | 31.3k |
| ChartQA | General VQA | 17.2k |
| A-OKVQA | General VQA | 16.1k |
| AI2D | Science-Targeted VQA | 11.4k |
| GeoQA+ | Science-Targeted VQA | 11.4k |
| ScienceQA | Science-Targeted VQA | 5.6k |
| DocVQA | General VQA | 4.0k |
| PISC | General VQA | 1.0k |
| CLEVR | General VQA | 0.5k |
| CLEVR-Math | Science-Targeted VQA | 0.5k |
Additionally, we used GPT-4o to generate structured answers. For details on the generation process, refer to [dataset_generation/generate.py](https://github.com/PKU-YuanGroup/LLaVA-CoT/blob/main/dataset_generation/generate.py).
## Bias, Risks, and Limitations
We have provided the sources of the images to the best of our ability. If you believe there is any infringement, please contact us immediately. We will remove the dataset and reference the provided links instead.
The training images and questions are sourced from open datasets, and the answers are generated by GPT-4o. Despite our efforts to ensure diversity, some biases may still exist.
## Citation
```bibtex
@misc{xu2024llavacot,
title={LLaVA-CoT: Let Vision Language Models Reason Step-by-Step},
author={Guowei Xu and Peng Jin and Hao Li and Yibing Song and Lichao Sun and Li Yuan},
year={2024},
eprint={2411.10440},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.10440},
}
```
# LLaVA-CoT 数据集卡片
**LLaVA-CoT-100k** 数据集首次在论文《LLaVA-CoT:让视觉语言模型逐步推理》([https://huggingface.co/papers/2411.10440](https://huggingface.co/papers/2411.10440))中被提出。本数据集旨在赋能视觉语言模型(Vision-Language Models, VLMs)实现自主多阶段推理,整合了来自多种视觉问答来源的样本与结构化推理标注,以解决视觉语言模型在复杂视觉问答任务中开展系统性、结构化推理时面临的诸多挑战。
## 数据集来源
- **代码仓库**:[https://github.com/PKU-YuanGroup/LLaVA-CoT](https://github.com/PKU-YuanGroup/LLaVA-CoT)
- **相关论文**:[https://arxiv.org/abs/2411.10440](https://arxiv.org/abs/2411.10440)
## 样本使用方法
你可通过Hugging Face `datasets`库轻松加载本数据集,并按照项目指引完成图像配置与数据使用。
**1. 使用Hugging Face `datasets`库加载:**
python
from datasets import load_dataset
# 加载 LLaVA-CoT-100k 数据集
dataset = load_dataset("Xkev/LLaVA-CoT-100k")
# 访问训练划分
train_split = dataset["train"]
# 打印单条样本
print(train_split[0])
**2. 本地图像准备:**
该代码仓库包含 `image.zip.part-{aa-ap}` 系列分卷文件,你需手动合并以得到完整图像归档:
bash
cat image.zip.part-* > image.zip
unzip image.zip
**3. 推理:**
你可使用与 Llama-3.2-11B-Vision-Instruct 相同的代码加载模型并执行推理。如需了解测试时逐阶段回溯搜索(SWIRES)的详细操作指引,请参阅[GitHub仓库](https://github.com/PKU-YuanGroup/LLaVA-CoT/blob/main/inference/README.md)中的 `inference/README.md` 文件。
**4. 微调:**
如需复现论文中的实验结果,你可借助 `llama-recipes` 提供的微调脚本。请务必将 `train/cot_dataset.py` 中的 `data_path` 与 `image_base_path` 修改为你本地训练数据集的实际路径。
bash
cd train
pip install llama-recipes
torchrun --nnodes 1 --nproc_per_node 8 --master_port 29500 finetuning.py
--enable_fsdp --lr 1e-5 --num_epochs 3 --batch_size_training 4
--model_name meta-llama/Llama-3.2-11B-Vision-Instruct
--dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder LLaVA-CoT
--use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test"
--custom_dataset.file "datasets/cot_dataset.py" --run_validation False
--batching_strategy padding
## 数据集结构
`train.jsonl` 文件包含如下格式的问答数据:
json
{
"id": ID,
"image": IMAGE_PATH,
"conversations": [{"from": "human", "value": QUESTION},{"from": "gpt", "value": ANSWER}]
}
## 数据集构建
我们使用了开源数据集的图像与问答样本,其分布如下:
| **数据集名称** | **任务类型** | **样本规模** |
|---------------------|------------------------|-----------|
| ShareGPT4V | 通用视觉问答 | 31.3k |
| ChartQA | 通用视觉问答 | 17.2k |
| A-OKVQA | 通用视觉问答 | 16.1k |
| AI2D | 面向科学任务的视觉问答 | 11.4k |
| GeoQA+ | 面向科学任务的视觉问答 | 11.4k |
| ScienceQA | 面向科学任务的视觉问答 | 5.6k |
| DocVQA | 通用视觉问答 | 4.0k |
| PISC | 通用视觉问答 | 1.0k |
| CLEVR | 通用视觉问答 | 0.5k |
| CLEVR-Math | 面向科学任务的视觉问答 | 0.5k |
此外,我们借助 GPT-4o 生成了结构化答案。如需了解生成流程的详细细节,请参阅[GitHub仓库](https://github.com/PKU-YuanGroup/LLaVA-CoT/blob/main/dataset_generation/generate.py)中的 `dataset_generation/generate.py` 文件。
## 偏差、风险与局限性
我们已尽最大努力标注图像的来源。若你认为本数据集存在任何侵权内容,请立即联系我们,我们将移除该数据集并替换为提供的参考链接。
本数据集的训练图像与问答样本均来自开源数据集,答案则由 GPT-4o 生成。尽管我们已尽力确保样本多样性,但仍可能存在部分偏差。
## 引用格式
bibtex
@misc{xu2024llavacot,
title={LLaVA-CoT: Let Vision Language Models Reason Step-by-Step},
author={Guowei Xu and Peng Jin and Hao Li and Yibing Song and Lichao Sun and Li Yuan},
year={2024},
eprint={2411.10440},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.10440},
}
提供机构:
maas
创建时间:
2024-12-03



