OpenDataArena/ODA-Mixture-100k

Name: OpenDataArena/ODA-Mixture-100k
Creator: OpenDataArena
Published: 2026-01-21 02:52:26
License: 暂无描述

Hugging Face2026-01-21 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/OpenDataArena/ODA-Mixture-100k

下载链接

链接失效反馈

官方服务：

资源简介：

ODA-Mixture-100k是一个从顶级开放语料库（通过OpenDataArena排行榜筛选）中精选并经过去重和基准去污处理的紧凑型通用后训练数据集。该数据集覆盖多个领域（如数学、代码、推理、通用等），格式为问题→解决方案（推理轨迹）→最终答案。精选训练集规模约10万样本，目标是通过小规模精选数据集在多个领域（数学、代码、推理等）实现显著的通用性能提升。数据集构建流程包括数据收集（基于LIMO、AM-Thinking-v1-Distilled-math和AM-Thinking-v1-Distilled-code）、去重与去污处理以及通过语义聚类优选最具挑战性的样本。数据格式为JSON，包含唯一标识符、数据来源、问题文本和响应文本。评估显示，该数据集在Qwen2.5-7B-Base和Qwen3-8B-Base模型上均能带来一致的性能提升，尤其在数学和代码领域表现突出。

ODA-Mixture-100k is a compact general-purpose post-training dataset curated from top-performing open corpora (selected via the OpenDataArena leaderboard) and refined through deduplication and benchmark decontamination. The dataset spans multiple domains (e.g., Math, Code, Reasoning, General) and follows a Problem → Solution (reasoning trace) → Final answer format. The selected training set contains ~100K samples, aiming to achieve significant general-purpose performance gains across various domains using a small-scale, curated dataset. The data curation pipeline includes data collection (based on LIMO, AM-Thinking-v1-Distilled-math, and AM-Thinking-v1-Distilled-code), deduplication & decontamination, and semantic clustering to sample the most challenging instances. The data format is JSON, including unique identifiers, data sources, question text, and response text. Evaluations show consistent improvements over Qwen2.5-7B-Base and Qwen3-8B-Base models, with particularly strong gains in Math and Code domains.

提供机构：

OpenDataArena

5,000+

优质数据集

54 个

任务类型

进入经典数据集