Miaow-Lab/SSAE-Dataset
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Miaow-Lab/SSAE-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- question-answering
language:
- en
configs:
- config_name: gsm8k
data_files:
- split: train
path: gsm8k_385K_train.json
- split: validation
path: gsm8k_385K_valid.json
- config_name: numina
data_files:
- split: train
path: numina_859K_train.json
- split: validation
path: numina_859K_valid.json
- config_name: opencodeinstruct
data_files:
- split: train
path: opencodeinstruct_36K_train.json
- split: validation
path: opencodeinstruct_36K_valid.json
tags:
- math
- code
- sparse_autoencoder
---
# Dataset Card
This is the official dataset repository for the paper **"Step-Level Sparse Autoencoder for Reasoning Process Interpretation"**.
- **Paper:** [Arxiv](https://arxiv.org/abs/2603.03031)
- **Code:** [GitHub](https://github.com/Miaow-Lab/SSAE)
- **Collection:** [HuggingFace](https://huggingface.co/collections/Miaow-Lab/ssae)
## Dataset Overview
The repository hosts three distinct datasets covering the domains of **mathematical reasoning** and **code generation**. Each subset is pre-partitioned into training and validation splits to facilitate reproducible experiments.
### 1. GSM8K (Math)
- **Description:** A high-quality dataset of grade-school math word problems.
- **Splits:** Train / Validation
### 2. Numina (Math Competition)
- **Description:** A challenging dataset derived from mathematics competitions (e.g., AIME, AMC). It focuses on complex logical reasoning and advanced problem-solving skills.
- **Splits:** Train / Validation
### 3. OpenCodeInstruct (Code)
- **Description:** A large-scale, multi-turn instruction-tuning dataset designed to enhance the code generation and reasoning capabilities of large language models.
- **Splits:** Train / Validation
## Usage
You can load a specific subset by specifying the `config_name` (second argument) in the `load_dataset` function.
```python
from datasets import load_dataset
# 1. Load the GSM8K subset
ds_gsm8k = load_dataset("Miaow-Lab/SSAE-Dataset", "gsm8k")
# 2. Load the Numina subset
ds_numina = load_dataset("Miaow-Lab/SSAE-Dataset", "numina")
# 3. Load the OpenCodeInstruct subset
ds_code = load_dataset("Miaow-Lab/SSAE-Dataset", "opencodeinstruct")
```
## Citation
If you use this dataset or the associated code in your research, please cite our paper:
```bibtex
@misc{yang2026steplevelsparseautoencoderreasoning,
title={Step-Level Sparse Autoencoder for Reasoning Process Interpretation},
author={Xuan Yang and Jiayu Liu and Yuhang Lai and Hao Xu and Zhenya Huang and Ning Miao},
year={2026},
eprint={2603.03031},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.03031},
}
```
提供机构:
Miaow-Lab



