five

Miaow-Lab/SSAE-Dataset

收藏
Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Miaow-Lab/SSAE-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - question-answering language: - en configs: - config_name: gsm8k data_files: - split: train path: gsm8k_385K_train.json - split: validation path: gsm8k_385K_valid.json - config_name: numina data_files: - split: train path: numina_859K_train.json - split: validation path: numina_859K_valid.json - config_name: opencodeinstruct data_files: - split: train path: opencodeinstruct_36K_train.json - split: validation path: opencodeinstruct_36K_valid.json tags: - math - code - sparse_autoencoder --- # Dataset Card This is the official dataset repository for the paper **"Step-Level Sparse Autoencoder for Reasoning Process Interpretation"**. - **Paper:** [Arxiv](https://arxiv.org/abs/2603.03031) - **Code:** [GitHub](https://github.com/Miaow-Lab/SSAE) - **Collection:** [HuggingFace](https://huggingface.co/collections/Miaow-Lab/ssae) ## Dataset Overview The repository hosts three distinct datasets covering the domains of **mathematical reasoning** and **code generation**. Each subset is pre-partitioned into training and validation splits to facilitate reproducible experiments. ### 1. GSM8K (Math) - **Description:** A high-quality dataset of grade-school math word problems. - **Splits:** Train / Validation ### 2. Numina (Math Competition) - **Description:** A challenging dataset derived from mathematics competitions (e.g., AIME, AMC). It focuses on complex logical reasoning and advanced problem-solving skills. - **Splits:** Train / Validation ### 3. OpenCodeInstruct (Code) - **Description:** A large-scale, multi-turn instruction-tuning dataset designed to enhance the code generation and reasoning capabilities of large language models. - **Splits:** Train / Validation ## Usage You can load a specific subset by specifying the `config_name` (second argument) in the `load_dataset` function. ```python from datasets import load_dataset # 1. Load the GSM8K subset ds_gsm8k = load_dataset("Miaow-Lab/SSAE-Dataset", "gsm8k") # 2. Load the Numina subset ds_numina = load_dataset("Miaow-Lab/SSAE-Dataset", "numina") # 3. Load the OpenCodeInstruct subset ds_code = load_dataset("Miaow-Lab/SSAE-Dataset", "opencodeinstruct") ``` ## Citation If you use this dataset or the associated code in your research, please cite our paper: ```bibtex @misc{yang2026steplevelsparseautoencoderreasoning, title={Step-Level Sparse Autoencoder for Reasoning Process Interpretation}, author={Xuan Yang and Jiayu Liu and Yuhang Lai and Hao Xu and Zhenya Huang and Ning Miao}, year={2026}, eprint={2603.03031}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2603.03031}, } ```
提供机构:
Miaow-Lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作