five

Brain2nd/NeuronSpark-Pretrain-v3

收藏
Hugging Face2026-04-23 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Brain2nd/NeuronSpark-Pretrain-v3
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh license: mit task_categories: - text-generation tags: - pretraining - bilingual - neuromorphic - snn - reasoning - cot size_categories: - 10B<n<100B --- # NeuronSpark-Pretrain-v3 Bilingual pretraining corpus for NeuronSpark v3, a bio-inspired Spiking Neural Network language model with selective PLIF neurons and dynamic per-token compute budget (PonderNet-v3). ## Composition | Metric | Value | |---|---| | Total documents | **18.2 M** | | Estimated tokens | **~20 B** | | Format | 37 Parquet shards (~1 GB each, zstd) | | Schema | `text: string, source: string` | | Languages | EN 55.6%, ZH 28.1%, code 16.3% | | Deduplication | **All source sampling is weighted so each doc appears ≤ 1×** (no artificial repetition) | ### Category mix (sampled at WRITE time → any prefix ≈ target ratio) | Category | Target | Actual | |---|---|---| | en_web (education / web) | 22% | 21.3% | | zh_web (Chinese web) | 20% | 19.5% | | synthetic (textbook / benchmark) | 13% | 16.7% | | code (Python/JS/TS) | 15% | 16.3% | | r1_distill (reasoning CoT) | 10% | 8.7% | | math (algebra/calculus/olympics) | 7% | 7.2% | | narrative (novels / stories) | 10% | 8.2% | | zh_pro (Chinese domain knowledge) | 3% | 2.1% | ## Sources (26 active) ### EN web (22%) - `fineweb-edu-10BT` (FineWeb-Edu 10BT sample) — 4.26 B tokens ### ZH web (20%) - `skypile-150B` (SkyPile-150B subsample) — 3.72 B - `seq-monkey` (Mobvoi Seq-Monkey) — 0.18 B ### Synthetic (13%) - `cosmopedia` (HuggingFaceTB Cosmopedia) — 3.33 B - `benchmarks-pretrain` (MMLU / ARC / BoolQ / HellaSwag / PIQA / SIQA / Winogrande / OpenBookQA + C3 / CEval / ChID / CMMLU / CMRC2018, all splits merged into plain text) — 0.009 B ### Code (15%) - `github-code-py-js-ts` (codeparrot/github-code-clean, filtered to Python/JavaScript/TypeScript) — 3.26 B ### R1-distill reasoning (10%) - `mxode-reasoning-distil` (Chinese CoT) — 0.30 B - `qwq-longcot-130k` — 0.29 B - `chinese-r1-110k` (Congliu Chinese-DeepSeek-R1-Distill) — 0.22 B - `bespoke-stratos-17k` — 0.09 B - `zake-openscience-zh` (Chinese science reasoning) — 0.07 B - `open-thoughts-114k` — 0.76 B - `qwq/s1K/LIMO` (various CoT sources, EN) ### Math (7%) - `openwebmath` — 1.36 B - `numinamath-cot` (MATH + olympiad) — ~0.09 B - `mxode-cmid-math` (Chinese physics/math solutions) — 0.12 B - `mxode-school-math` (Chinese K-12 math CoT) — 0.05 B - `almonster-mathinstruct-zh` — 0.004 B ### Narrative (10%) — added v3.3 to counter encyclopedia/wiki bias - `gutenberg-en` (sedthh/gutenberg_english, public-domain EN lit) — 0.87 B - `webnovel-zh` (wdndev/webnovel-chinese, 6 of 10 shards) — 0.69 B - `tinystories` (roneneldan/TinyStories) — 0.08 B ### ZH pro (3%) - `zhihu-kol` (wangrui6/Zhihu-KOL) — 0.27 B - `medical-zh` (shibing624/medical pretrain subset) — ~0.05 B - `coig-cqia` (m-a-p/COIG-CQIA) — ~0.05 B - `belle-math` (BelleGroup/school_math_0.25M) — ~0.03 B ## Processing 1. **Pass 1 — per-source Bernoulli downsample to staging** - Stream each source (parquet / jsonl / HF-arrow) - Filter docs with `< 200 chars` - For R1-distill + chat-style sources: apply ChatML wrapping (`<|im_start|>role\n…<|im_end|>`) with `<think>reasoning</think>` wrapping for explicit CoT - Bernoulli `keep_prob` chosen so downsampled pool matches target budget 2. **Pass 2 — weighted-draw interleave (stop-anywhere-safe)** - For each draw, pick source `i` with probability `target_w[cat_i] × rows_i / Σ_{j ∈ cat} rows_j × avg_tok_j` - Guarantees category-token-share ≈ target at ANY prefix of shards - No oversampling: each doc ≤ 1× copy in final output (wraps = 1 for all sources) 3. **Shard layout** - 500 000 docs per shard → ~37 shards of ~1 GB each - Shuffled across sources during interleave ## Language-level breakdown | Language | Tokens | Share | |---|---|---| | English | 11.12 B | 55.6% | | Chinese | 5.61 B | 28.1% | | Code | 3.26 B | 16.3% | ## Intended use Pretraining the **NeuronSpark v3** SNN language model: - 1 B parameters, bio-inspired PLIF neurons with selective firing - PonderNet-v3 dynamic K per token (Gumbel-ST + forced exploration) - Muon optimizer on matrix params + AuxAdam on everything else - DeepSpeed ZeRO-0 for Muon compatibility ## Build scripts See `scripts/v3_data/` in the NeuronSpark-V1 repository: - `manifest.py` — source declarations + target weights - `build_pretrain_mix.py` — pass-1 downsample + pass-2 interleave - `build_benchmark_pretrain.py` — convert HF eval benchmarks to plain text - `download_extra.py` — fetch external sources (github-code, gutenberg, etc.) ## Notes - **No deduplication** is performed beyond per-source filters. Upstream sources are already deduplicated. - **Benchmark contamination**: Per user directive `没有区分训练集测试集的直接混`, the benchmark eval subset (all splits merged) IS included in pretrain. Downstream eval scores should be interpreted with this in mind. - `lambada_openai` is permanently excluded (project policy).

语言: - 英语 - 中文 许可证:MIT 任务类别: - 文本生成 标签: - 预训练 - 双语 - 神经形态 - 脉冲神经网络(Spiking Neural Network, SNN) - 推理 - 思维链(Chain of Thought, CoT) 规模类别: - 100亿 < 令牌数 < 1000亿 # NeuronSpark-Pretrain-v3 本数据集为NeuronSpark v3的双语预训练语料库,NeuronSpark v3是一款受生物启发的脉冲神经网络(Spiking Neural Network, SNN)语言模型,采用选择性PLIF神经元与动态逐令牌计算预算(PonderNet-v3)。 ## 数据集组成 | 指标 | 数值 | |---|---| | 总文档数 | **1820万** | | 预估令牌数 | **~200亿** | | 存储格式 | 37个Parquet分片(每个约1GB,zstd压缩) | | 数据Schema | `text: string, source: string` | | 语言占比 | 英语55.6%、中文28.1%、代码16.3% | | 去重规则 | **所有源采样均采用加权策略,每个文档最多出现1次(无人工重复)** | ### 类别占比(写入时采样 → 任意分片前缀的占比均近似目标值) | 类别 | 目标占比 | 实际占比 | |---|---|---| | en_web(英语网络文本,教育/网页领域) | 22% | 21.3% | | zh_web(中文网络文本) | 20% | 19.5% | | synthetic(合成文本,教科书/基准数据集) | 13% | 16.7% | | code(代码,Python/JS/TS) | 15% | 16.3% | | r1_distill(R1蒸馏,推理思维链) | 10% | 8.7% | | math(数学,代数/微积分/奥赛) | 7% | 7.2% | | narrative(叙事文本,小说/故事) | 10% | 8.2% | | zh_pro(中文专业领域知识) | 3% | 2.1% | ## 数据源(共26个活跃源) ### 英语网络文本(占比22%) - `fineweb-edu-10BT`(FineWeb-Edu 10BT采样子集)——42.6亿令牌 ### 中文网络文本(占比20%) - `skypile-150B`(SkyPile-150B采样子集)——37.2亿令牌 - `seq-monkey`(Mobvoi Seq-Monkey数据集)——1.8亿令牌 ### 合成文本(占比13%) - `cosmopedia`(HuggingFaceTB Cosmopedia数据集)——33.3亿令牌 - `benchmarks-pretrain`(MMLU / ARC / BoolQ / HellaSwag / PIQA / SIQA / Winogrande / OpenBookQA + C3 / CEval / ChID / CMMLU / CMRC2018,所有拆分合并为纯文本格式)——900万令牌 ### 代码文本(占比15%) - `github-code-py-js-ts`(codeparrot/github-code-clean,过滤为Python/JavaScript/TypeScript代码)——32.6亿令牌 ### R1蒸馏推理文本(占比10%) - `mxode-reasoning-distil`(中文思维链数据集)——3.0亿令牌 - `qwq-longcot-130k`——2.9亿令牌 - `chinese-r1-110k`(Congliu Chinese-DeepSeek-R1-Distill数据集)——2.2亿令牌 - `bespoke-stratos-17k`——0.9亿令牌 - `zake-openscience-zh`(中文科学推理数据集)——0.7亿令牌 - `open-thoughts-114k`——7.6亿令牌 - `qwq/s1K/LIMO`(多种思维链数据源,英语) ### 数学文本(占比7%) - `openwebmath`——13.6亿令牌 - `numinamath-cot`(MATH数据集+奥赛题)——~0.9亿令牌 - `mxode-cmid-math`(中文物理/数学解题数据集)——1.2亿令牌 - `mxode-school-math`(中文K12数学思维链数据集)——0.5亿令牌 - `almonster-mathinstruct-zh`——40万令牌 ### 叙事文本(占比10%)——v3.3版本新增,用于抵消百科/维基文本的偏差 - `gutenberg-en`(sedthh/gutenberg_english,公有领域英语文学)——8.7亿令牌 - `webnovel-zh`(wdndev/webnovel-chinese,10个分片中的6个)——6.9亿令牌 - `tinystories`(roneneldan/TinyStories数据集)——0.8亿令牌 ### 中文专业领域文本(占比3%) - `zhihu-kol`(wangrui6/Zhihu-KOL数据集)——2.7亿令牌 - `medical-zh`(shibing624/medical预训练子集)——~0.5亿令牌 - `coig-cqia`(m-a-p/COIG-CQIA数据集)——~0.5亿令牌 - `belle-math`(BelleGroup/school_math_0.25M数据集)——~0.3亿令牌 ## 数据处理流程 1. **流程1 — 按源执行伯努利下采样至暂存池** - 流式读取每个数据源(Parquet / JSONL / HF-Arrow格式) - 过滤字符数少于200的文档 - 针对R1蒸馏与对话风格数据源:采用ChatML格式封装(`<|im_start|>角色 …<|im_end|>`),并使用`<think>推理过程</think>`封装显式思维链内容 - 选择伯努利`keep_prob`参数,使下采样后的暂存池匹配目标预算 2. **流程2 — 加权抽取交错(支持任意位置截断)** - 每次抽取时,以概率`target_w[cat_i] × rows_i / Σ_{j ∈ cat} rows_j × avg_tok_j`选择源`i` - 保证在任意分片前缀下,类别令牌占比均近似目标值 - 无非重复采样:最终输出中每个文档最多出现1次(所有源的采样次数均为1) 3. **分片布局** - 每个分片包含50万文档 → 共约37个分片,每个约1GB - 交错过程中跨源打乱数据顺序 ## 语言级别分布 | 语言 | 令牌数 | 占比 | |---|---|---| | 英语 | 111.2亿 | 55.6% | | 中文 | 56.1亿 | 28.1% | | 代码 | 32.6亿 | 16.3% | ## 预期用途 用于预训练**NeuronSpark v3**脉冲神经网络语言模型: - 参数量为10亿,采用受生物启发的PLIF神经元与选择性放电机制 - 搭载PonderNet-v3动态逐令牌K值计算(Gumbel-ST + 强制探索) - 对矩阵参数使用Muon优化器,其余参数使用AuxAdam优化器 - 采用DeepSpeed ZeRO-0分布式训练框架以适配Muon优化器 ## 构建脚本 详见NeuronSpark-V1仓库中的`scripts/v3_data/`目录: - `manifest.py` — 数据源声明与目标权重配置 - `build_pretrain_mix.py` — 流程1下采样与流程2交错实现 - `build_benchmark_pretrain.py` — 将Hugging Face评估基准数据集转换为纯文本格式 - `download_extra.py` — 拉取外部数据源(github-code、gutenberg等) ## 注意事项 - **仅按源过滤,未执行额外去重**:上游数据源已完成去重,本数据集未新增去重操作。 - **基准数据集污染**:根据用户指示,未对训练集与测试集进行区分,直接混合使用。基准评估子集(所有拆分合并后)已被纳入预训练集,因此在解释下游评估分数时需注意这一点。 - `lambada_openai`数据集已被永久排除(遵循项目政策)。
提供机构:
Brain2nd
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作