Strongich/LLaVA-CoT-filtered
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Strongich/LLaVA-CoT-filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: LLaVA-CoT-filtered
license: apache-2.0
size_categories:
- 10K<n<100K
task_categories:
- visual-question-answering
tags:
- multimodal
- reasoning
- distillation
- image
- text
- datasets
---
# Dataset card for LLaVA-CoT-filtered
`Strongich/LLaVA-CoT-filtered` is a filtered version of [Xkev/LLaVA-CoT-100k](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k). The goal is to provide a cleaner multimodal reasoning dataset for training and distillation, with leakage-contaminated traces removed before release.
Each row keeps the original example `id` and `image`, and preserves the original multi-turn `conversations` structure after normalization. The exported `conversations` field renames speaker roles from `human`/`gpt` to `user`/`assistant`, removes leading `<SUMMARY>...</SUMMARY>` content from assistant turns, and strips structural reasoning tags while keeping their inner text.
## Creation Process
### 1. Load the original LLaVA-CoT dataset
The dataset is built from [Xkev/LLaVA-CoT-100k](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k). The original examples are loaded together with their corresponding images before filtering and export.
### 2. Remove leakage-contaminated records
Any record is removed if any assistant turn contains one of the following leakage phrases after cleaning:
- `standard answer`
- `correct answer`
- `reference answer`
- `given answer`
- `provided answer`
Learning chain-of-thought reasoning from traces containing these phrases caused small VLMs to hallucinate around this leakage pattern, often repeating or anchoring on those words instead of continuing useful reasoning. In practice, this pushed generation into "doom looping" behavior and moved the model away from the correct answer.
### 3. Normalize the original conversations
For each retained example, the original `conversations` field is normalized while preserving the multi-turn structure:
- `human` is renamed to `user`
- `gpt` is renamed to `assistant`
- leading `<SUMMARY>...</SUMMARY>` content is removed from assistant turns
- the structural tags `<CAPTION>`, `<REASONING>`, and `<CONCLUSION>` are stripped while their inner text is kept
### 4. Export the cleaned dataset
The final dataset preserves the original `id` field and the single `image` column, drops the redundant `images` column, and is exported in Parquet format as `train.parquet`.
## Intended Use
This dataset is meant for training or distilling smaller multimodal models on cleaner reasoning traces derived from LLaVA-CoT. In particular, it is intended for setups where leakage-heavy chain-of-thought traces would otherwise teach the student model to repeat harmful answer-reference patterns instead of reasoning toward the correct answer.
## Code Availability
The dataset generation code is not included in this release. It will be released later.
提供机构:
Strongich



