w601sxs/simplecot_subset_50k
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/w601sxs/simplecot_subset_50k
下载链接
链接失效反馈官方服务:
资源简介:
# Creating the simpleCoT 50K Subset
## Overview
The 50K subset was created from the larger simpleCoT dataset (2.2M examples) through a 4-stage pipeline that normalizes, filters, deduplicates, and stratifies the data for balanced GRPO training.
## Pipeline
### Stage 1: Normalization & Format Extraction
**Goal:** Strip dataset-specific scaffolding to produce clean (question, answer) pairs.
The raw dataset wraps questions and answers in XML-like tags:
```
prompt: "Context: ...\nquestion: <ACTUAL QUESTION>\n..."
completion: "answer: <ACTUAL ANSWER>"
```
**Processing:**
- Extract raw question text from `question: <...>` wrapper using regex
- Extract raw answer text from `answer: <...>` wrapper
- Drop malformed examples (missing tags or too short)
- Minimum length filter: question >10 chars, answer >5 chars
**Output:** ~2.1M → ~1.8M examples with columns `prompt` (question) and `completion` (answer)
### Stage 2: Length Filtering
**Goal:** Keep examples within a reasonable token length range to ensure quality and training stability.
**Filters:**
- Question: 20–300 tokens (using Qwen2.5-0.5B tokenizer)
- Answer: 10–200 tokens
This removes overly short trivial questions and extremely long answers that are hard to grade with ROUGE-L.
**Output:** ~1.8M → ~1.4M examples
### Stage 3: Deduplication
**Goal:** Remove exact duplicate questions.
**Method:**
- Hash the first 80 characters of each question
- Keep first occurrence, drop subsequent duplicates
- Uses single-process filter to preserve insertion order for reproducibility
**Output:** ~1.4M → ~1.35M unique examples
### Stage 4: Stratified Sampling
**Goal:** Create a balanced 50K subset with representative coverage across question difficulty.
**Stratification:**
- Proxy difficulty by question token length (longer = typically harder/more complex)
- Bin questions into 4 quartiles based on prompt length distribution
- Uniformly sample ~12.5K from each bin (50K / 4)
- If any bin has <12.5K examples, fill remaining slots from overflow pool
**Output:** 50K examples with balanced representation across question complexity.
## Dataset Composition
The final 50K subset is pushed to HuggingFace Hub at: **`w601sxs/simplecot_subset_50k`**
**Statistics:**
- Total examples: 50,000
- Clean (question, completion) pairs
- Question token length: typically 20–300 tokens
- Answer token length: typically 10–200 tokens
- Stratified across 4 difficulty bins
## Usage
Load the subset for training:
```python
from datasets import load_dataset
ds = load_dataset("w601sxs/simplecot_subset_50k", split="train", token=HF_TOKEN)
```
Or run the subsetting pipeline locally:
```bash
# Create 50K subset and save locally
python subset_data.py --size 50000
# Create 1K smoke test subset
python subset_data.py --size 1000
# Push to hub
python subset_data.py --size 50000 --push_to_hub w601sxs/simplecot_subset_50k
```
## Rationale
This pipeline ensures:
1. **Genericity:** Strips all dataset-specific formatting for reusable training data
2. **Quality:** Removes malformed and out-of-range examples
3. **Deduplication:** No data leakage from repeated examples
4. **Balance:** Representation across difficulty levels prevents bias toward easy or hard questions
5. **Reproducibility:** Deterministic (seed=42) for consistent results across runs
# 构建simpleCoT 50K子集
## 概述
该50K子集源自规模更大的simpleCoT数据集(含220万条样本),通过四阶段流水线完成数据归一化、过滤、去重与分层,以实现均衡的GRPO训练。
## 流水线
### 阶段1:归一化与格式提取
**目标:** 剥离数据集专属的封装格式,生成规整的(问题,答案)对。
原始数据集将问题与答案封装在类XML标签中:
prompt: "Context: ...
question: <ACTUAL QUESTION>
..."
completion: "answer: <ACTUAL ANSWER>"
**处理步骤:**
- 通过正则表达式从`question: <...>`封装中提取原始问题文本,从`answer: <...>`封装中提取原始答案文本
- 丢弃格式错误的样本(缺失标签或长度过短)
- 设置最小长度过滤规则:问题字符数大于10,答案字符数大于5
**输出结果:** 样本量从约210万缩减至约180万,列名为`prompt`(对应问题)与`completion`(对应答案)。
### 阶段2:长度过滤
**目标:** 将样本限定在合理的Token长度范围内,以保障训练质量与稳定性。
**过滤规则:**
- 问题:20~300个Token(使用Qwen2.5-0.5B分词器)
- 答案:10~200个Token
此举可移除过于简短的无意义问题,以及难以通过ROUGE-L指标评分的超长答案。
**输出结果:** 样本量从约180万缩减至约140万。
### 阶段3:去重
**目标:** 移除完全重复的问题样本。
**实现方法:**
- 对每个问题的前80个字符计算哈希值
- 保留首次出现的样本,丢弃后续重复项
- 采用单进程过滤以保留插入顺序,确保结果可复现
**输出结果:** 样本量从约140万缩减至约135万条唯一样本。
### 阶段4:分层采样
**目标:** 构建均衡的50K子集,实现问题难度维度的代表性覆盖。
**分层策略:**
- 以问题的Token长度作为难度代理指标(通常长度越长,问题难度越高、逻辑越复杂)
- 根据问题长度分布将样本划分为4个四分位组
- 从每个分组中均匀采样约12.5K条样本(50K /4)
- 若某分组样本量不足12.5K,则从溢出池中补充剩余配额
**输出结果:** 得到50K条样本,在问题复杂度维度上分布均衡。
## 数据集构成
最终的50K子集已上传至HuggingFace Hub,地址为:**`w601sxs/simplecot_subset_50k`**
**统计信息:**
- 总样本数:50,000条
- 均为规整的(问题,答案)对
- 问题Token长度通常为20~300
- 答案Token长度通常为10~200
- 按4个难度层级进行分层采样
## 使用方式
加载该子集用于训练的代码示例:
python
from datasets import load_dataset
ds = load_dataset("w601sxs/simplecot_subset_50k", split="train", token=HF_TOKEN)
或本地运行子集构建流水线的命令:
bash
# 构建50K子集并本地保存
python subset_data.py --size 50000
# 构建1K条样本的冒烟测试子集
python subset_data.py --size 1000
# 上传至HuggingFace Hub
python subset_data.py --size 50000 --push_to_hub w601sxs/simplecot_subset_50k
## 设计依据
该流水线可保障以下核心特性:
1. **通用性**:剥离所有数据集专属格式,打造可复用的训练数据
2. **质量性**:移除格式错误与超出长度范围的低质量样本
3. **去重性**:消除重复样本带来的数据泄露风险
4. **均衡性**:覆盖不同难度层级的样本,避免模型偏向简单或复杂问题
5. **可复现性**:采用确定性流程(随机种子=42),确保不同运行环境下结果一致
提供机构:
w601sxs



