five

Design-Logic-ReasoningBook (DLR-Book), Design-Logic-Reasoning-Web (DLR-Web)

收藏
arXiv2025-08-18 更新2025-11-27 收录
下载链接:
https://hf-mirror.com/datasets/Attention1115/DLR-Web
下载链接
链接失效反馈
官方服务:
资源简介:
DESIGNER是一个数据工程流程,用于从原始文本语料库(如书籍和网页)中合成具有挑战性的问题。基于此流程,我们构建了两个大规模推理数据集:Design-Logic-ReasoningBook (DLR-Book)和Design-Logic-Reasoning-Web (DLR-Web),分别包含304万和166万个挑战性问题。这些数据集涵盖了75个学科,不仅包括常见的数学领域,还包括STEM、人文和社会科学以及应用和专业领域。通过引入“设计逻辑”这一概念,DESIGNER能够模仿人类教育专家在问题创造中的智慧,从而显著提高合成问题的推理深度和多样性。数据集分析表明,我们的方法生成的题目在难度和多样性方面明显优于基线数据集。在Qwen3-8B-Base和Qwen3-4B-Base模型上进行的SFT实验验证了这些数据集的有效性。结果表明,使用我们的数据集训练的模型不仅在多学科推理能力方面显著提高,而且在相同数据量的现有数据集上也表现出色。使用完整数据集训练的模型甚至超过了官方Qwen3模型的性能。

DESIGNER is a data engineering workflow for synthesizing challenging questions from raw text corpora such as books and web pages. Based on this workflow, we constructed two large-scale reasoning datasets: Design-Logic-ReasoningBook (DLR-Book) and Design-Logic-Reasoning-Web (DLR-Web), which contain 3.04 million and 1.66 million challenging questions respectively. These datasets cover 75 disciplines, including not only common mathematical fields but also STEM, humanities, social sciences, applied and professional domains. By introducing the concept of "design logic", DESIGNER can emulate the wisdom of human education experts in question creation, thereby significantly enhancing the reasoning depth and diversity of synthesized questions. Dataset analysis demonstrates that the questions generated by our method significantly outperform those from baseline datasets in terms of difficulty and diversity. Supervised Fine-Tuning (SFT) experiments conducted on Qwen3-8B-Base and Qwen3-4B-Base models validate the effectiveness of these datasets. The results show that models trained with our datasets not only achieve significant improvements in multi-disciplinary reasoning capabilities but also perform excellently compared to models trained on existing datasets of the same scale. Models trained on the full dataset even outperform the official Qwen3 model’s performance.
提供机构:
阿里巴巴集团, 南京大学
创建时间:
2025-08-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作