five

alibaba-pai/OmniThoughtV_Filter_0.5M

收藏
Hugging Face2026-01-26 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/alibaba-pai/OmniThoughtV_Filter_0.5M
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en --- ## Dataset Introduction OmniThoughtV is a large-scale multimodal long-chain-of-thought dataset distilled from the [FineVision](https://huggingface.co/datasets/HuggingFaceM4/FineVision) dataset using Alibaba Cloud's AI platform (PAI) distillation toolkit, [EasyDistill](https://github.com/modelscope/easydistill). This dataset establishes a transparent and reproducible data distillation pipeline, enabling efficient construction of multimodal reasoning chains of thought. Fine-tuning smaller models with this dataset effectively endows them with stronger reasoning capabilities and enhances their performance on downstream tasks. ## Dataset Usage ```jsonc from datasets import load_dataset data_path='...YOUR_PATH_To.../OmmiThoughtV_Raw_1.8M/' OmmiThoughtV=load_dataset(data_path, streaming=True) ``` ## Dataset Examples Dataset examples are shown below. In these examples, images are stored in base64-encoded string format. <img src="readme/sample.png" alt="描述文字" width="50%"> ```jsonc [ { "role": "system", "content": "You are a helpful assistant to think step by step. Provide your reasoning steps within <thinking></thinking> tags and give your final answer within <answer></answer> tags." }, { "role": "user", "content": “ ### Question What is the currency used in the invoice? ### Output Format (Strictly Enforced) <thinking> Clearly explain your reasoning step by step. Describe how you arrived at the conclusion. The reasoning process MUST BE enclosed within <thinking> </thinking> tags. </thinking> <answer> Your final answer to the user's question. </answer> <image> ” }, { "role": "assistant", "content": "<thinking> To determine the currency used in the invoice, I will analyze the document for any indications of currency symbols or references. 1. The invoice is from the CRC Contract Research Center, and it includes prices listed next to each item. 2. Looking at the prices, such as \"5.574,-\", \"10.580,-\", \"6.968,-\", \"689,-\", and \"3.242,-\", these numbers are followed by a comma and a dash, which is typical formatting in some European countries for decimal points and currency notation. 3. The total amount at the bottom is \"24.153,-\". This format is consistent with Belgian francs (BEF), which was the currency used in Belgium before the euro. 4. The document is dated November 24, 1993, and Belgium used the Belgian franc until 2002 when it adopted the euro. 5. Additionally, the company's address is in Belgium (B-1932 Zaventem), and the phone number format is also consistent with Belgium. 6. The use of \"BP\" in the price column likely stands for \"Belgian Francs\" (Belgische Frank), which was the currency code for the Belgian franc. Given all these observations, the currency used in the invoice is the Belgian franc (BEF). </thinking> <answer> Belgian franc (BEF) </answer> “ } ] ``` ## Evaluation We fine-tuned Qwen3-VL 2B, 4B, and 8B models on a filtered version of the OmniThoughtV dataset containing 500k samples and evaluated them using the LMMs-Eval framework. Since different prompt templates can affect evaluation results, we consistently used a fixed inference prompt throughout our evaluations (see [Evaluation Details](eval/eval.md) for specifics). The experimental results are summarized in the table below. Our experiments successfully validated the effectiveness of the dataset. Results show that all fine-tuned models achieved performance improvements across various benchmarks—both general visual understanding benchmarks (e.g., AI2D, MMStar) and benchmarks more sensitive to reasoning capabilities (e.g., MMMU_Pro_standard, MMMU_Pro_vision, MathVerse, MathVision). Notably, the gains were especially pronounced on reasoning-intensive benchmarks. This clearly demonstrates that distilling high-quality data, carefully filtering it, and scaling it up effectively enhances model reasoning capabilities. <img src="readme/ovbch.png" alt="评测结果" width="100%"> ## Reference - Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang. Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations. arXiv preprint - Yuanjie Lyu, Chengyu Wang, Jun Huang, Tong Xu. From Correction to Mastery: Reinforced Distillation of Large Language Model Agents. arXiv preprint - Chengyu Wang, Junbing Yan, Wenrui Cai, Yuanhao Yue, Jun Huang. EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models. EMNLP 2025 - Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang. Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series. EMNLP 2025 - Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang. Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment. EMNLP 2025 - Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang. DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models. ACL 2025 - Yuanhao Yue, Chengyu Wang, Jun Huang, Peng Wang. Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud. COLING 2025 - Yuanhao Yue, Chengyu Wang, Jun Huang, Peng Wang. Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning. EMNLP 2024

OmniThoughtV is a large-scale multimodal long-chain-of-thought dataset distilled from the FineVision dataset using Alibaba Clouds EasyDistill toolkit. This dataset establishes a transparent and reproducible data distillation pipeline, enabling efficient construction of multimodal reasoning chains of thought. Fine-tuning smaller models with this dataset effectively endows them with stronger reasoning capabilities and enhances their performance on downstream tasks. The dataset includes rich examples demonstrating step-by-step reasoning to answer questions, and its effectiveness in improving model performance has been validated through experiments.
提供机构:
alibaba-pai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作