Strongich/LLaVA-CoT-filtered

Name: Strongich/LLaVA-CoT-filtered
Creator: Strongich
Published: 2026-04-15 14:34:30
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Strongich/LLaVA-CoT-filtered

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: LLaVA-CoT-filtered license: apache-2.0 size_categories: - 10K<n<100K task_categories: - visual-question-answering tags: - multimodal - reasoning - distillation - image - text - datasets --- # Dataset card for LLaVA-CoT-filtered `Strongich/LLaVA-CoT-filtered` is a filtered version of [Xkev/LLaVA-CoT-100k](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k). The goal is to provide a cleaner multimodal reasoning dataset for training and distillation, with leakage-contaminated traces removed before release. Each row keeps the original example `id` and `image`, and preserves the original multi-turn `conversations` structure after normalization. The exported `conversations` field renames speaker roles from `human`/`gpt` to `user`/`assistant`, removes leading `<SUMMARY>...</SUMMARY>` content from assistant turns, and strips structural reasoning tags while keeping their inner text. ## Creation Process ### 1. Load the original LLaVA-CoT dataset The dataset is built from [Xkev/LLaVA-CoT-100k](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k). The original examples are loaded together with their corresponding images before filtering and export. ### 2. Remove leakage-contaminated records Any record is removed if any assistant turn contains one of the following leakage phrases after cleaning: - `standard answer` - `correct answer` - `reference answer` - `given answer` - `provided answer` Learning chain-of-thought reasoning from traces containing these phrases caused small VLMs to hallucinate around this leakage pattern, often repeating or anchoring on those words instead of continuing useful reasoning. In practice, this pushed generation into "doom looping" behavior and moved the model away from the correct answer. ### 3. Normalize the original conversations For each retained example, the original `conversations` field is normalized while preserving the multi-turn structure: - `human` is renamed to `user` - `gpt` is renamed to `assistant` - leading `<SUMMARY>...</SUMMARY>` content is removed from assistant turns - the structural tags `<CAPTION>`, `<REASONING>`, and `<CONCLUSION>` are stripped while their inner text is kept ### 4. Export the cleaned dataset The final dataset preserves the original `id` field and the single `image` column, drops the redundant `images` column, and is exported in Parquet format as `train.parquet`. ## Intended Use This dataset is meant for training or distilling smaller multimodal models on cleaner reasoning traces derived from LLaVA-CoT. In particular, it is intended for setups where leakage-heavy chain-of-thought traces would otherwise teach the student model to repeat harmful answer-reference patterns instead of reasoning toward the correct answer. ## Code Availability The dataset generation code is not included in this release. It will be released later.

提供机构：

Strongich

5,000+

优质数据集

54 个

任务类型

进入经典数据集