five

w601sxs/simplecot_subset_50k

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/w601sxs/simplecot_subset_50k
下载链接
链接失效反馈
官方服务:
资源简介:
# Creating the simpleCoT 50K Subset ## Overview The 50K subset was created from the larger simpleCoT dataset (2.2M examples) through a 4-stage pipeline that normalizes, filters, deduplicates, and stratifies the data for balanced GRPO training. ## Pipeline ### Stage 1: Normalization & Format Extraction **Goal:** Strip dataset-specific scaffolding to produce clean (question, answer) pairs. The raw dataset wraps questions and answers in XML-like tags: ``` prompt: "Context: ...\nquestion: <ACTUAL QUESTION>\n..." completion: "answer: <ACTUAL ANSWER>" ``` **Processing:** - Extract raw question text from `question: <...>` wrapper using regex - Extract raw answer text from `answer: <...>` wrapper - Drop malformed examples (missing tags or too short) - Minimum length filter: question >10 chars, answer >5 chars **Output:** ~2.1M → ~1.8M examples with columns `prompt` (question) and `completion` (answer) ### Stage 2: Length Filtering **Goal:** Keep examples within a reasonable token length range to ensure quality and training stability. **Filters:** - Question: 20–300 tokens (using Qwen2.5-0.5B tokenizer) - Answer: 10–200 tokens This removes overly short trivial questions and extremely long answers that are hard to grade with ROUGE-L. **Output:** ~1.8M → ~1.4M examples ### Stage 3: Deduplication **Goal:** Remove exact duplicate questions. **Method:** - Hash the first 80 characters of each question - Keep first occurrence, drop subsequent duplicates - Uses single-process filter to preserve insertion order for reproducibility **Output:** ~1.4M → ~1.35M unique examples ### Stage 4: Stratified Sampling **Goal:** Create a balanced 50K subset with representative coverage across question difficulty. **Stratification:** - Proxy difficulty by question token length (longer = typically harder/more complex) - Bin questions into 4 quartiles based on prompt length distribution - Uniformly sample ~12.5K from each bin (50K / 4) - If any bin has <12.5K examples, fill remaining slots from overflow pool **Output:** 50K examples with balanced representation across question complexity. ## Dataset Composition The final 50K subset is pushed to HuggingFace Hub at: **`w601sxs/simplecot_subset_50k`** **Statistics:** - Total examples: 50,000 - Clean (question, completion) pairs - Question token length: typically 20–300 tokens - Answer token length: typically 10–200 tokens - Stratified across 4 difficulty bins ## Usage Load the subset for training: ```python from datasets import load_dataset ds = load_dataset("w601sxs/simplecot_subset_50k", split="train", token=HF_TOKEN) ``` Or run the subsetting pipeline locally: ```bash # Create 50K subset and save locally python subset_data.py --size 50000 # Create 1K smoke test subset python subset_data.py --size 1000 # Push to hub python subset_data.py --size 50000 --push_to_hub w601sxs/simplecot_subset_50k ``` ## Rationale This pipeline ensures: 1. **Genericity:** Strips all dataset-specific formatting for reusable training data 2. **Quality:** Removes malformed and out-of-range examples 3. **Deduplication:** No data leakage from repeated examples 4. **Balance:** Representation across difficulty levels prevents bias toward easy or hard questions 5. **Reproducibility:** Deterministic (seed=42) for consistent results across runs

# 构建simpleCoT 50K子集 ## 概述 该50K子集源自规模更大的simpleCoT数据集(含220万条样本),通过四阶段流水线完成数据归一化、过滤、去重与分层,以实现均衡的GRPO训练。 ## 流水线 ### 阶段1:归一化与格式提取 **目标:** 剥离数据集专属的封装格式,生成规整的(问题,答案)对。 原始数据集将问题与答案封装在类XML标签中: prompt: "Context: ... question: <ACTUAL QUESTION> ..." completion: "answer: <ACTUAL ANSWER>" **处理步骤:** - 通过正则表达式从`question: <...>`封装中提取原始问题文本,从`answer: <...>`封装中提取原始答案文本 - 丢弃格式错误的样本(缺失标签或长度过短) - 设置最小长度过滤规则:问题字符数大于10,答案字符数大于5 **输出结果:** 样本量从约210万缩减至约180万,列名为`prompt`(对应问题)与`completion`(对应答案)。 ### 阶段2:长度过滤 **目标:** 将样本限定在合理的Token长度范围内,以保障训练质量与稳定性。 **过滤规则:** - 问题:20~300个Token(使用Qwen2.5-0.5B分词器) - 答案:10~200个Token 此举可移除过于简短的无意义问题,以及难以通过ROUGE-L指标评分的超长答案。 **输出结果:** 样本量从约180万缩减至约140万。 ### 阶段3:去重 **目标:** 移除完全重复的问题样本。 **实现方法:** - 对每个问题的前80个字符计算哈希值 - 保留首次出现的样本,丢弃后续重复项 - 采用单进程过滤以保留插入顺序,确保结果可复现 **输出结果:** 样本量从约140万缩减至约135万条唯一样本。 ### 阶段4:分层采样 **目标:** 构建均衡的50K子集,实现问题难度维度的代表性覆盖。 **分层策略:** - 以问题的Token长度作为难度代理指标(通常长度越长,问题难度越高、逻辑越复杂) - 根据问题长度分布将样本划分为4个四分位组 - 从每个分组中均匀采样约12.5K条样本(50K /4) - 若某分组样本量不足12.5K,则从溢出池中补充剩余配额 **输出结果:** 得到50K条样本,在问题复杂度维度上分布均衡。 ## 数据集构成 最终的50K子集已上传至HuggingFace Hub,地址为:**`w601sxs/simplecot_subset_50k`** **统计信息:** - 总样本数:50,000条 - 均为规整的(问题,答案)对 - 问题Token长度通常为20~300 - 答案Token长度通常为10~200 - 按4个难度层级进行分层采样 ## 使用方式 加载该子集用于训练的代码示例: python from datasets import load_dataset ds = load_dataset("w601sxs/simplecot_subset_50k", split="train", token=HF_TOKEN) 或本地运行子集构建流水线的命令: bash # 构建50K子集并本地保存 python subset_data.py --size 50000 # 构建1K条样本的冒烟测试子集 python subset_data.py --size 1000 # 上传至HuggingFace Hub python subset_data.py --size 50000 --push_to_hub w601sxs/simplecot_subset_50k ## 设计依据 该流水线可保障以下核心特性: 1. **通用性**:剥离所有数据集专属格式,打造可复用的训练数据 2. **质量性**:移除格式错误与超出长度范围的低质量样本 3. **去重性**:消除重复样本带来的数据泄露风险 4. **均衡性**:覆盖不同难度层级的样本,避免模型偏向简单或复杂问题 5. **可复现性**:采用确定性流程(随机种子=42),确保不同运行环境下结果一致
提供机构:
w601sxs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作