w601sxs/simplecot_subset_50k

Name: w601sxs/simplecot_subset_50k
Creator: w601sxs
Published: 2026-03-26 03:14:16
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/w601sxs/simplecot_subset_50k

下载链接

链接失效反馈

官方服务：

资源简介：

# Creating the simpleCoT 50K Subset ## Overview The 50K subset was created from the larger simpleCoT dataset (2.2M examples) through a 4-stage pipeline that normalizes, filters, deduplicates, and stratifies the data for balanced GRPO training. ## Pipeline ### Stage 1: Normalization & Format Extraction **Goal:** Strip dataset-specific scaffolding to produce clean (question, answer) pairs. The raw dataset wraps questions and answers in XML-like tags: ``` prompt: "Context: ...\nquestion: <ACTUAL QUESTION>\n..." completion: "answer: <ACTUAL ANSWER>" ``` **Processing:** - Extract raw question text from `question: <...>` wrapper using regex - Extract raw answer text from `answer: <...>` wrapper - Drop malformed examples (missing tags or too short) - Minimum length filter: question >10 chars, answer >5 chars **Output:** ~2.1M → ~1.8M examples with columns `prompt` (question) and `completion` (answer) ### Stage 2: Length Filtering **Goal:** Keep examples within a reasonable token length range to ensure quality and training stability. **Filters:** - Question: 20–300 tokens (using Qwen2.5-0.5B tokenizer) - Answer: 10–200 tokens This removes overly short trivial questions and extremely long answers that are hard to grade with ROUGE-L. **Output:** ~1.8M → ~1.4M examples ### Stage 3: Deduplication **Goal:** Remove exact duplicate questions. **Method:** - Hash the first 80 characters of each question - Keep first occurrence, drop subsequent duplicates - Uses single-process filter to preserve insertion order for reproducibility **Output:** ~1.4M → ~1.35M unique examples ### Stage 4: Stratified Sampling **Goal:** Create a balanced 50K subset with representative coverage across question difficulty. **Stratification:** - Proxy difficulty by question token length (longer = typically harder/more complex) - Bin questions into 4 quartiles based on prompt length distribution - Uniformly sample ~12.5K from each bin (50K / 4) - If any bin has <12.5K examples, fill remaining slots from overflow pool **Output:** 50K examples with balanced representation across question complexity. ## Dataset Composition The final 50K subset is pushed to HuggingFace Hub at: **`w601sxs/simplecot_subset_50k`** **Statistics:** - Total examples: 50,000 - Clean (question, completion) pairs - Question token length: typically 20–300 tokens - Answer token length: typically 10–200 tokens - Stratified across 4 difficulty bins ## Usage Load the subset for training: ```python from datasets import load_dataset ds = load_dataset("w601sxs/simplecot_subset_50k", split="train", token=HF_TOKEN) ``` Or run the subsetting pipeline locally: ```bash # Create 50K subset and save locally python subset_data.py --size 50000 # Create 1K smoke test subset python subset_data.py --size 1000 # Push to hub python subset_data.py --size 50000 --push_to_hub w601sxs/simplecot_subset_50k ``` ## Rationale This pipeline ensures: 1. **Genericity:** Strips all dataset-specific formatting for reusable training data 2. **Quality:** Removes malformed and out-of-range examples 3. **Deduplication:** No data leakage from repeated examples 4. **Balance:** Representation across difficulty levels prevents bias toward easy or hard questions 5. **Reproducibility:** Deterministic (seed=42) for consistent results across runs

# 构建simpleCoT 50K子集 ## 概述该50K子集源自规模更大的simpleCoT数据集（含220万条样本），通过四阶段流水线完成数据归一化、过滤、去重与分层，以实现均衡的GRPO训练。 ## 流水线 ### 阶段1：归一化与格式提取 **目标：** 剥离数据集专属的封装格式，生成规整的（问题，答案）对。原始数据集将问题与答案封装在类XML标签中： prompt: "Context: ... question: <ACTUAL QUESTION> ..." completion: "answer: <ACTUAL ANSWER>" **处理步骤：** - 通过正则表达式从`question: <...>`封装中提取原始问题文本，从`answer: <...>`封装中提取原始答案文本 - 丢弃格式错误的样本（缺失标签或长度过短） - 设置最小长度过滤规则：问题字符数大于10，答案字符数大于5 **输出结果：** 样本量从约210万缩减至约180万，列名为`prompt`（对应问题）与`completion`（对应答案）。 ### 阶段2：长度过滤 **目标：** 将样本限定在合理的Token长度范围内，以保障训练质量与稳定性。 **过滤规则：** - 问题：20~300个Token（使用Qwen2.5-0.5B分词器） - 答案：10~200个Token 此举可移除过于简短的无意义问题，以及难以通过ROUGE-L指标评分的超长答案。 **输出结果：** 样本量从约180万缩减至约140万。 ### 阶段3：去重 **目标：** 移除完全重复的问题样本。 **实现方法：** - 对每个问题的前80个字符计算哈希值 - 保留首次出现的样本，丢弃后续重复项 - 采用单进程过滤以保留插入顺序，确保结果可复现 **输出结果：** 样本量从约140万缩减至约135万条唯一样本。 ### 阶段4：分层采样 **目标：** 构建均衡的50K子集，实现问题难度维度的代表性覆盖。 **分层策略：** - 以问题的Token长度作为难度代理指标（通常长度越长，问题难度越高、逻辑越复杂） - 根据问题长度分布将样本划分为4个四分位组 - 从每个分组中均匀采样约12.5K条样本（50K /4） - 若某分组样本量不足12.5K，则从溢出池中补充剩余配额 **输出结果：** 得到50K条样本，在问题复杂度维度上分布均衡。 ## 数据集构成最终的50K子集已上传至HuggingFace Hub，地址为：**`w601sxs/simplecot_subset_50k`** **统计信息：** - 总样本数：50,000条 - 均为规整的（问题，答案）对 - 问题Token长度通常为20~300 - 答案Token长度通常为10~200 - 按4个难度层级进行分层采样 ## 使用方式加载该子集用于训练的代码示例： python from datasets import load_dataset ds = load_dataset("w601sxs/simplecot_subset_50k", split="train", token=HF_TOKEN) 或本地运行子集构建流水线的命令： bash # 构建50K子集并本地保存 python subset_data.py --size 50000 # 构建1K条样本的冒烟测试子集 python subset_data.py --size 1000 # 上传至HuggingFace Hub python subset_data.py --size 50000 --push_to_hub w601sxs/simplecot_subset_50k ## 设计依据该流水线可保障以下核心特性： 1. **通用性**：剥离所有数据集专属格式，打造可复用的训练数据 2. **质量性**：移除格式错误与超出长度范围的低质量样本 3. **去重性**：消除重复样本带来的数据泄露风险 4. **均衡性**：覆盖不同难度层级的样本，避免模型偏向简单或复杂问题 5. **可复现性**：采用确定性流程（随机种子=42），确保不同运行环境下结果一致

提供机构：

w601sxs

5,000+

优质数据集

54 个

任务类型

进入经典数据集