OpenDataArena/ODA-Mixture-100k
收藏Hugging Face2026-01-21 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/OpenDataArena/ODA-Mixture-100k
下载链接
链接失效反馈官方服务:
资源简介:
ODA-Mixture-100k是一个从顶级开放语料库(通过OpenDataArena排行榜筛选)中精选并经过去重和基准去污处理的紧凑型通用后训练数据集。该数据集覆盖多个领域(如数学、代码、推理、通用等),格式为问题→解决方案(推理轨迹)→最终答案。精选训练集规模约10万样本,目标是通过小规模精选数据集在多个领域(数学、代码、推理等)实现显著的通用性能提升。数据集构建流程包括数据收集(基于LIMO、AM-Thinking-v1-Distilled-math和AM-Thinking-v1-Distilled-code)、去重与去污处理以及通过语义聚类优选最具挑战性的样本。数据格式为JSON,包含唯一标识符、数据来源、问题文本和响应文本。评估显示,该数据集在Qwen2.5-7B-Base和Qwen3-8B-Base模型上均能带来一致的性能提升,尤其在数学和代码领域表现突出。
ODA-Mixture-100k is a compact general-purpose post-training dataset curated from top-performing open corpora (selected via the OpenDataArena leaderboard) and refined through deduplication and benchmark decontamination. The dataset spans multiple domains (e.g., Math, Code, Reasoning, General) and follows a Problem → Solution (reasoning trace) → Final answer format. The selected training set contains ~100K samples, aiming to achieve significant general-purpose performance gains across various domains using a small-scale, curated dataset. The data curation pipeline includes data collection (based on LIMO, AM-Thinking-v1-Distilled-math, and AM-Thinking-v1-Distilled-code), deduplication & decontamination, and semantic clustering to sample the most challenging instances. The data format is JSON, including unique identifiers, data sources, question text, and response text. Evaluations show consistent improvements over Qwen2.5-7B-Base and Qwen3-8B-Base models, with particularly strong gains in Math and Code domains.
提供机构:
OpenDataArena



