OpenDCAI/Infinity-Instruct-Curated-1M

Name: OpenDCAI/Infinity-Instruct-Curated-1M
Creator: OpenDCAI
Published: 2026-03-16 12:54:31
License: 暂无描述

Hugging Face2026-03-16 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/OpenDCAI/Infinity-Instruct-Curated-1M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - question-answering size_categories: - 100K<n<1M --- # Infinity-Instruct-Curated-1M Beyond the sheer volume of instruction data, its quality is of equal importance. As a step in this direction, we have used the [Dataflow](https://github.com/OpenDCAI/DataFlow/tree/main) framework to sample the [infinity-instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset, producing a more compact subset with a focus on quality. We hope this curated data might be helpful for establishing baselines in instruction tuning and for future data mixture experiments. ## **Performance on Math Benchmarks** | **Model** | Math | GSM8K | AMC23 | AIME24 | Minerva | Gaokao | Olympiad | Math-Avg | | ---------------------- | :--: | :-------: | :---: | :----: | :-----: | :----: | :------: | :------: | | Qwen2.5-7B-Base | 62.8 | 67.1 | 45.0 | 10.0 | 17.6 | 27.5 | 29.6 | 37.1 | | +Infinity-Instruct-Curated-1M | 51.1 | 81.6 | 30.0 | 0.0 | 24.6 | 27.5 | 19.6 | 33.5 | | + Infinity-Instruct-3M | 51.6 | 82.7 | 30.0 | 0.0 | 22.4 | 26.4 | 22.4 | 33.6 | ## **Performance on Code and Knowledge Benchmarks** | **Model** | Humaneval | MBPP | Code-Avg | MMLU | C-EVAL | Knowledge-Avg | | ---------------------- | :-------: | :--: | :------: | :--: | :----: | :-----------: | | Qwen2.5-7B-Base | 78.7 | 74.3 | 76.5 | 71.9 | 80.0 | 76.0 | | +Infinity-Instruct-Curated-1M | 78.7 | 77.2 | 78.0 | 72.1 | 79.7 | 75.9 | | +Infinity-Instruct-3M | 76.8 | 78.6 | 77.7 | 72.1 | 79.6 | 75.9 | ## **Overview of Infinity-Instruct-Curated-1M** ![](overview.png) We first deduplicate the original Infinity-Instruct-3M dataset using the K-Center-Greedy algorithm, yielding a subset of 2 million instruction samples. This subset is then classified into single-turn and multi-turn dialogues. The single-turn data is scored across six dimensions: clarity, correlation, correctness, practicability, personification, and logicality. For multi-turn data, consistency serves as the primary evaluation criterion. By combining the high-scoring data from both categories, we obtain the final Infinity-Instruct-Curated-1M dataset. ## **Cite** If you use this dataset, please cite: ``` @article{liang2025dataflow, title={DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI}, author={Liang, Hao and Ma, Xiaochen and Liu, Zhou and Wong, Zhen Hao and Zhao, Zhengyang and Meng, Zimo and He, Runming and Shen, Chengyu and Cai, Qifeng and Han, Zhaoyang and others}, journal={arXiv preprint arXiv:2512.16676}, year={2025} } ```

提供机构：

OpenDCAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集