five

SmallThoughts

收藏
魔搭社区2026-01-02 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SmallThoughts
下载链接
链接失效反馈
官方服务:
资源简介:
# SmallThoughts <div align="center"> <img src="https://huggingface.co/datasets/SmallDoge/SmallThoughts/resolve/main/SmallThoughts.png" alt="Small-Thoughts Map" width="60%"/> </div> <div align="center"> <a href="https://discord.gg/P2yYH95N" target="_blank" style="margin: 2px;"> <img alt="Discord" src="https://img.shields.io/badge/Discord-Small%20Doges-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://github.com/SmallDoges/small-thoughts" target="_blank" style="margin: 2px;"> <img alt="GitHub" src="https://img.shields.io/badge/GitHub-SmallThoughts-181717?logo=github" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://github.com/SmallDoges/small-doge/blob/main/LICENSE" style="margin: 2px;"> <img alt="License" src="https://img.shields.io/badge/License-Apache--2.0-blue.svg" style="display: inline-block; vertical-align: middle;"/> </a> </div> --- Open synthetic reasoning dataset, covering math, science, code, and puzzles. To address the issue of the existing DeepSeek R1 distilled data being too long, this dataset constrains the reasoning trajectory to be more precise and concise while retaining the reflective nature. We also open-sourced the pipeline code for distilled data [here](https://github.com/SmallDoges/small-thoughts), with just one command you can generate your own dataset. ## How to use You can load the dataset with the following code: ```python import datasets dataset = datasets.load_dataset("SmallDoge/SmallThoughts") ``` If you are using [TRL](https://github.com/huggingface/trl) for model training, The `problem` and `solution` columns can be used for **GRPO** reinforcement learning, and the `messages` columns can be used for **SFT** fine-tuning, without any additional preprocessing. ## Visualization All examples, clustered by semantic similarity, can be explored in [Nomic Atlas](https://atlas.nomic.ai/data/losercheems/smallthoughts/map). <a href="https://atlas.nomic.ai/data/losercheems/smallthoughts/map"> <img src="https://huggingface.co/datasets/SmallDoge/SmallThoughts/resolve/main/small_thoughts_map.png" alt="Nomic Atlas Small-Thoughts Map" width="40%"/> </a> # License This dataset is released under the Apache-2.0 License. # Citation ```bibtex @misc{wu2025concisereasoningbiggains, title={Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting}, author={Yifan Wu and Jingze Shi and Bingheng Wu and Jiayi Zhang and Xiaotian Lin and Nan Tang and Yuyu Luo}, year={2025}, eprint={2505.19716}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.19716}, } ```

# SmallThoughts 数据集 <div align="center"> <img src="https://huggingface.co/datasets/SmallDoge/SmallThoughts/resolve/main/SmallThoughts.png" alt="Small-Thoughts 映射图" width="60%"/> </div> <div align="center"> <a href="https://discord.gg/P2yYH95N" target="_blank" style="margin: 2px;"> <img alt="Discord" src="https://img.shields.io/badge/Discord-Small%20Doges-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://github.com/SmallDoges/small-thoughts" target="_blank" style="margin: 2px;"> <img alt="GitHub" src="https://img.shields.io/badge/GitHub-SmallThoughts-181717?logo=github" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://github.com/SmallDoges/small-doge/blob/main/LICENSE" style="margin: 2px;"> <img alt="License" src="https://img.shields.io/badge/License-Apache--2.0-blue.svg" style="display: inline-block; vertical-align: middle;"/> </a> </div> --- 本数据集为开源合成推理数据集,涵盖数学、科学、代码与谜题四大领域。 针对现有DeepSeek R1蒸馏数据过长的问题,本数据集在保留推理反思特性的前提下,对推理轨迹进行了更精准、简洁的约束。 我们已在[此处](https://github.com/SmallDoges/small-thoughts)开源了蒸馏数据的处理流水线代码,仅需执行一条命令即可生成自定义数据集。 ## 使用方法 可通过如下代码加载本数据集: python import datasets dataset = datasets.load_dataset("SmallDoge/SmallThoughts") 若使用[TRL](https://github.com/huggingface/trl)进行模型训练,`problem`与`solution`列可直接用于**GRPO**强化学习,`messages`列可直接用于**SFT**微调,无需额外预处理。 ## 可视化展示 所有样本通过语义相似度聚类后,可在[Nomic Atlas](https://atlas.nomic.ai/data/losercheems/smallthoughts/map)中浏览探索。 <a href="https://atlas.nomic.ai/data/losercheems/smallthoughts/map"> <img src="https://huggingface.co/datasets/SmallDoge/SmallThoughts/resolve/main/small_thoughts_map.png" alt="Nomic Atlas Small-Thoughts 映射图" width="40%"/> </a> # 许可证 本数据集采用Apache-2.0许可证开源发布。 # 引用格式 bibtex @misc{wu2025concisereasoningbiggains, title={Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting}, author={Yifan Wu and Jingze Shi and Bingheng Wu and Jiayi Zhang and Xiaotian Lin and Nan Tang and Yuyu Luo}, year={2025}, eprint={2505.19716}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.19716}, }
提供机构:
maas
创建时间:
2025-03-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作