five

kunato/ODA-Mixture-100k

收藏
Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kunato/ODA-Mixture-100k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: id dtype: int64 - name: source dtype: string - name: instruction dtype: string - name: response dtype: string splits: - name: train num_bytes: 3840935895 num_examples: 101306 download_size: 1490250591 dataset_size: 3840935895 --- # ODA-Mixture-100k <img src="performance.png" alt="Subject Distribution" width="1200" /> ODA-Mixture-100k is a compact general-purpose post-training dataset curated from top-performing open corpora (selected via the *OpenDataArena* leaderboard) and refined through deduplication, benchmark decontamination. --- ## 🧠 Dataset Summary - **Domain**: General-purpose(e.g., Math, Code, Reasoning, General). - **Format**: Problem → Solution (reasoning trace) → Final answer. - **Scale (selected training set)**: ~**100K** samples. - **Goal**: Achieve significant general-purpose performance gains across various domains (Math, Code, Reasoning, etc.) using a small-scale, curated dataset of ~100K samples. --- ## ⚙️ Data Curation Pipeline ODA-Mixture-100k is built by following a single rule: **trust the OpenDataArena leaderboard**. ### 1️⃣ Data Collection We chose **LIMO** as our foundation because it achieves a high ranking on the ODA overall leaderboard with very few samples. This efficiency allows us to establish a strong reasoning baseline. We then augment this core with **AM-Thinking-v1-Distilled-math** and **AM-Thinking-v1-Distilled-code**, the top-performing and efficient dataset on the ODA Math and Code leaderboards, to enhance specialized domain capabilities. ### 2️⃣ Deduplication & Decontamination We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks. ### 3️⃣ Data Selection To adhere to our ~100K data budget while maximizing the impact of each sample, we employ semantic clustering to map the overall data distribution. Within each cluster, we preferentially sample the most challenging instances, using sequence length as a practical proxy for reasoning complexity and problem difficulty. --- ## 📚 Source Composition | Source | Count | Percentage | |---|---:|---:| | LIMO | 817 | 0.81% | | AM-Thinking-Distilled-math | 50,244 | 49.59% | | AM-Thinking-Distilled-code| 50,245 | 49.60% | --- ## 🧩 Data Format ```json { "id": "unique_identifier", "source": "data source", "question": "textual question or instruction", "response": "textual response" } ``` --- ## 📈 Performance ODA-Mixture-100k is evaluated as an SFT corpus for both **Qwen2.5-7B-Base** and **Qwen3-8B-Base**. Across the full ODA benchmark suite spanning four domains—**General (DROP, IFEVAL, AGIEVAL, MMLU-Pro)**, **Math (GSM8K, MATH500, Omni-Math, OlympiadBench, AIME2024)**, **Code (HumanEval, MBPP, LCB (V5), HumanEval+)**, and **Reasoning (ARC-C, BBH, CALM, KOR-BENCH)**—we observe consistent improvements over the corresponding base checkpoints, with particularly strong gains on several benchmarks. <div style="overflow-x: auto; font-family: sans-serif; margin-bottom: 20px;"> <table style="width: 100%; border-collapse: collapse; text-align: center; font-size: 14px; min-width: 900px; color: inherit;"> <caption style="padding: 10px; font-weight: bold;"> Leaderboard Performance Comparison. Best scores in <b>bold</b>, second-best <u>underlined</u>. Eff. denotes Data Efficiency. </caption> <thead> <tr style="border-top: 2px solid currentColor; border-bottom: 1px solid currentColor;"> <th style="text-align: left; padding: 8px;">Dataset</th> <th>Size</th> <th>Eff.</th> <th>General</th> <th>Math</th> <th>Code</th> <th>Reasoning</th> <th style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>AVG</b></th> </tr> </thead> <tbody> <!-- ================= Qwen2.5-7B-Base ================= --> <tr style="background-color: rgba(128, 128, 128, 0.08); font-weight: bold;"> <td colspan="8" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen2.5-7B-Base</td> </tr> <tr> <td style="text-align: left; padding: 8px;">Qwen2.5-7B-Base</td> <td>-</td><td>-</td> <td>51.4</td><td>39.8</td><td>50.1</td><td>42.7</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">46.0</td> </tr> <tr> <td style="text-align: left; padding: 8px;">OpenThoughts3-1.2M</td> <td>1.2M</td><td>+0.011</td> <td>45.5</td><td>71.8</td><td><u>67.0</u></td><td>54.3</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">59.6</td> </tr> <tr> <td style="text-align: left; padding: 8px;">OmniThought-0528</td> <td>365k</td><td>+0.027</td> <td>47.1</td><td>71.2</td><td>47.6</td><td>57.2</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">55.8</td> </tr> <tr> <td style="text-align: left; padding: 8px;">SYNTHETIC-2-SFT-verified</td> <td>105k</td><td>+0.086</td> <td>51.3</td><td>69.8</td><td>40.1</td><td><u>58.9</u></td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">55.0</td> </tr> <tr> <td style="text-align: left; padding: 8px;">AM-Thinking-v1-Distilled-math</td> <td>558k</td><td>+0.016</td> <td>57.7</td><td><b>77.4</b></td><td>39.5</td><td>44.8</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.8</td> </tr> <tr> <td style="text-align: left; padding: 8px;">LIMO</td> <td>817</td><td><b>+9.920</b></td> <td><u>60.7</u></td><td>44.0</td><td>57.9</td><td>53.8</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.1</td> </tr> <tr> <td style="text-align: left; padding: 8px;">MiroMind-M1-SFT-719K</td> <td>719k</td><td>+0.006</td> <td>52.0</td><td>71.0</td><td>26.3</td><td>51.5</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">50.2</td> </tr> <tr> <td style="text-align: left; padding: 8px;">AM-Thinking-v1-Distilled-code</td> <td>324k</td><td>+0.024</td> <td>49.9</td><td>52.3</td><td><b>68.7</b></td><td>44.4</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">53.8</td> </tr> <tr> <td style="text-align: left; padding: 8px;">Light-R1-SFTData</td> <td>79k</td><td>+0.084</td> <td>55.5</td><td>64.4</td><td>38.8</td><td>51.9</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">52.7</td> </tr> <tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold;"> <td style="text-align: left; padding: 8px;">ODA-Mixture-500k</td> <td>500k</td><td>+0.039</td> <td><b>63.4</b></td><td><u>72.8</u></td><td>66.7</td><td><b>59.6</b></td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>65.6</b></td> </tr> <tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold;"> <td style="text-align: left; padding: 8px;">ODA-Mixture-100k</td> <td>100k</td><td><u>+0.149</u></td> <td>56.8</td><td>71.2</td><td>64.4</td><td>51.5</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>61.0</u></td> </tr> <!-- ================= Qwen3-8B-Base ================= --> <tr style="border-top: 1px solid currentColor; background-color: rgba(128, 128, 128, 0.08); font-weight: bold;"> <td colspan="8" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen3-8B-Base</td> </tr> <tr> <td style="text-align: left; padding: 8px;">Qwen3-8B-Base</td> <td>-</td><td>-</td> <td>58.7</td><td>51.2</td><td>52.4</td><td>50.6</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">53.2</td> </tr> <tr> <td style="text-align: left; padding: 8px;">MiroMind-M1-SFT-719K</td> <td>719k</td><td>+0.023</td> <td>64.5</td><td>77.2</td><td>63.6</td><td>65.8</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">67.8</td> </tr> <tr> <td style="text-align: left; padding: 8px;">AM-Thinking-v1-Distilled-math</td> <td>558k</td><td>+0.028</td> <td><u>65.9</u></td><td><b>79.7</b></td><td>59.5</td><td>63.2</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">67.1</td> </tr> <tr> <td style="text-align: left; padding: 8px;">OmniThought-0528</td> <td>365k</td><td>+0.043</td> <td>55.8</td><td><u>78.3</u></td><td>68.1</td><td>66.0</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">67.0</td> </tr> <tr> <td style="text-align: left; padding: 8px;">AM-Thinking-v1-Distilled-code</td> <td>324k</td><td>+0.045</td> <td>64.8</td><td>64.9</td><td><b>75.8</b></td><td>59.3</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">66.2</td> </tr> <tr> <td style="text-align: left; padding: 8px;">Light-R1-SFTData</td> <td>79k</td><td>+0.168</td> <td>64.9</td><td>71.8</td><td>59.0</td><td>63.6</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">64.8</td> </tr> <tr> <td style="text-align: left; padding: 8px;">SYNTHETIC-2-SFT-verified</td> <td>105k</td><td>+0.107</td> <td>59.5</td><td>75.4</td><td>56.1</td><td><u>66.6</u></td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">64.4</td> </tr> <tr> <td style="text-align: left; padding: 8px;">LIMO</td> <td>817</td><td><b>+0.490</b></td> <td>61.7</td><td>46.0</td><td>52.7</td><td>54.1</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">53.6</td> </tr> <tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold;"> <td style="text-align: left; padding: 8px;">ODA-Mixture-500k</td> <td>500k</td><td>+0.042</td> <td><b>71.2</b></td><td>77.2</td><td>73.0</td><td><b>69.7</b></td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>72.8</b></td> </tr> <tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold; border-bottom: 2px solid currentColor;"> <td style="text-align: left; padding: 8px;">ODA-Mixture-100k</td> <td>100k</td><td><u>+0.177</u></td> <td>61.1</td><td>77.3</td><td><u>73.2</u></td><td>64.7</td> <td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>69.0</u></td> </tr> </tbody> </table> </div> --- ## 🌐 About OpenDataArena [OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing. **Key Features:** - 🏆 **Dataset Leaderboard** — helps researchers identify **the most valuable and high-quality datasets across different domains**. - 📊 **Detailed Evaluation Scores** — provides **comprehensive metrics** to assess data quality, complexity, difficulty etc. - 🧰 **Data Processing Toolkit** — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) offers an open-source pipeline for dataset curation and scoring. If you find our work helpful, please consider **⭐ starring and subscribing** to support our research. --- ## 📚 Citation ```bibtex @article{gao2025closing, title={Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets}, author={Gao, Xin and Wang, Xiaoyang and Zhu, Yun and Cai, Mengzhang and He, Conghui and Wu, Lijun}, journal={arXiv preprint arXiv:2601.09733}, year={2025} } ``` ```bibtex @article{cai2025opendataarena, title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value}, author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others}, journal={arXiv preprint arXiv:2512.14051}, year={2025} } ```
提供机构:
kunato
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作