Brain2nd/NeuronSpark-Pretrain-v3
收藏Hugging Face2026-04-23 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Brain2nd/NeuronSpark-Pretrain-v3
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zh
license: mit
task_categories:
- text-generation
tags:
- pretraining
- bilingual
- neuromorphic
- snn
- reasoning
- cot
size_categories:
- 10B<n<100B
---
# NeuronSpark-Pretrain-v3
Bilingual pretraining corpus for NeuronSpark v3, a bio-inspired Spiking Neural
Network language model with selective PLIF neurons and dynamic per-token compute
budget (PonderNet-v3).
## Composition
| Metric | Value |
|---|---|
| Total documents | **18.2 M** |
| Estimated tokens | **~20 B** |
| Format | 37 Parquet shards (~1 GB each, zstd) |
| Schema | `text: string, source: string` |
| Languages | EN 55.6%, ZH 28.1%, code 16.3% |
| Deduplication | **All source sampling is weighted so each doc appears ≤ 1×** (no artificial repetition) |
### Category mix (sampled at WRITE time → any prefix ≈ target ratio)
| Category | Target | Actual |
|---|---|---|
| en_web (education / web) | 22% | 21.3% |
| zh_web (Chinese web) | 20% | 19.5% |
| synthetic (textbook / benchmark) | 13% | 16.7% |
| code (Python/JS/TS) | 15% | 16.3% |
| r1_distill (reasoning CoT) | 10% | 8.7% |
| math (algebra/calculus/olympics) | 7% | 7.2% |
| narrative (novels / stories) | 10% | 8.2% |
| zh_pro (Chinese domain knowledge) | 3% | 2.1% |
## Sources (26 active)
### EN web (22%)
- `fineweb-edu-10BT` (FineWeb-Edu 10BT sample) — 4.26 B tokens
### ZH web (20%)
- `skypile-150B` (SkyPile-150B subsample) — 3.72 B
- `seq-monkey` (Mobvoi Seq-Monkey) — 0.18 B
### Synthetic (13%)
- `cosmopedia` (HuggingFaceTB Cosmopedia) — 3.33 B
- `benchmarks-pretrain` (MMLU / ARC / BoolQ / HellaSwag / PIQA / SIQA / Winogrande / OpenBookQA + C3 / CEval / ChID / CMMLU / CMRC2018, all splits merged into plain text) — 0.009 B
### Code (15%)
- `github-code-py-js-ts` (codeparrot/github-code-clean, filtered to Python/JavaScript/TypeScript) — 3.26 B
### R1-distill reasoning (10%)
- `mxode-reasoning-distil` (Chinese CoT) — 0.30 B
- `qwq-longcot-130k` — 0.29 B
- `chinese-r1-110k` (Congliu Chinese-DeepSeek-R1-Distill) — 0.22 B
- `bespoke-stratos-17k` — 0.09 B
- `zake-openscience-zh` (Chinese science reasoning) — 0.07 B
- `open-thoughts-114k` — 0.76 B
- `qwq/s1K/LIMO` (various CoT sources, EN)
### Math (7%)
- `openwebmath` — 1.36 B
- `numinamath-cot` (MATH + olympiad) — ~0.09 B
- `mxode-cmid-math` (Chinese physics/math solutions) — 0.12 B
- `mxode-school-math` (Chinese K-12 math CoT) — 0.05 B
- `almonster-mathinstruct-zh` — 0.004 B
### Narrative (10%) — added v3.3 to counter encyclopedia/wiki bias
- `gutenberg-en` (sedthh/gutenberg_english, public-domain EN lit) — 0.87 B
- `webnovel-zh` (wdndev/webnovel-chinese, 6 of 10 shards) — 0.69 B
- `tinystories` (roneneldan/TinyStories) — 0.08 B
### ZH pro (3%)
- `zhihu-kol` (wangrui6/Zhihu-KOL) — 0.27 B
- `medical-zh` (shibing624/medical pretrain subset) — ~0.05 B
- `coig-cqia` (m-a-p/COIG-CQIA) — ~0.05 B
- `belle-math` (BelleGroup/school_math_0.25M) — ~0.03 B
## Processing
1. **Pass 1 — per-source Bernoulli downsample to staging**
- Stream each source (parquet / jsonl / HF-arrow)
- Filter docs with `< 200 chars`
- For R1-distill + chat-style sources: apply ChatML wrapping (`<|im_start|>role\n…<|im_end|>`) with `<think>reasoning</think>` wrapping for explicit CoT
- Bernoulli `keep_prob` chosen so downsampled pool matches target budget
2. **Pass 2 — weighted-draw interleave (stop-anywhere-safe)**
- For each draw, pick source `i` with probability `target_w[cat_i] × rows_i / Σ_{j ∈ cat} rows_j × avg_tok_j`
- Guarantees category-token-share ≈ target at ANY prefix of shards
- No oversampling: each doc ≤ 1× copy in final output (wraps = 1 for all sources)
3. **Shard layout**
- 500 000 docs per shard → ~37 shards of ~1 GB each
- Shuffled across sources during interleave
## Language-level breakdown
| Language | Tokens | Share |
|---|---|---|
| English | 11.12 B | 55.6% |
| Chinese | 5.61 B | 28.1% |
| Code | 3.26 B | 16.3% |
## Intended use
Pretraining the **NeuronSpark v3** SNN language model:
- 1 B parameters, bio-inspired PLIF neurons with selective firing
- PonderNet-v3 dynamic K per token (Gumbel-ST + forced exploration)
- Muon optimizer on matrix params + AuxAdam on everything else
- DeepSpeed ZeRO-0 for Muon compatibility
## Build scripts
See `scripts/v3_data/` in the NeuronSpark-V1 repository:
- `manifest.py` — source declarations + target weights
- `build_pretrain_mix.py` — pass-1 downsample + pass-2 interleave
- `build_benchmark_pretrain.py` — convert HF eval benchmarks to plain text
- `download_extra.py` — fetch external sources (github-code, gutenberg, etc.)
## Notes
- **No deduplication** is performed beyond per-source filters. Upstream sources
are already deduplicated.
- **Benchmark contamination**: Per user directive `没有区分训练集测试集的直接混`,
the benchmark eval subset (all splits merged) IS included in pretrain. Downstream
eval scores should be interpreted with this in mind.
- `lambada_openai` is permanently excluded (project policy).
语言:
- 英语
- 中文
许可证:MIT
任务类别:
- 文本生成
标签:
- 预训练
- 双语
- 神经形态
- 脉冲神经网络(Spiking Neural Network, SNN)
- 推理
- 思维链(Chain of Thought, CoT)
规模类别:
- 100亿 < 令牌数 < 1000亿
# NeuronSpark-Pretrain-v3
本数据集为NeuronSpark v3的双语预训练语料库,NeuronSpark v3是一款受生物启发的脉冲神经网络(Spiking Neural Network, SNN)语言模型,采用选择性PLIF神经元与动态逐令牌计算预算(PonderNet-v3)。
## 数据集组成
| 指标 | 数值 |
|---|---|
| 总文档数 | **1820万** |
| 预估令牌数 | **~200亿** |
| 存储格式 | 37个Parquet分片(每个约1GB,zstd压缩) |
| 数据Schema | `text: string, source: string` |
| 语言占比 | 英语55.6%、中文28.1%、代码16.3% |
| 去重规则 | **所有源采样均采用加权策略,每个文档最多出现1次(无人工重复)** |
### 类别占比(写入时采样 → 任意分片前缀的占比均近似目标值)
| 类别 | 目标占比 | 实际占比 |
|---|---|---|
| en_web(英语网络文本,教育/网页领域) | 22% | 21.3% |
| zh_web(中文网络文本) | 20% | 19.5% |
| synthetic(合成文本,教科书/基准数据集) | 13% | 16.7% |
| code(代码,Python/JS/TS) | 15% | 16.3% |
| r1_distill(R1蒸馏,推理思维链) | 10% | 8.7% |
| math(数学,代数/微积分/奥赛) | 7% | 7.2% |
| narrative(叙事文本,小说/故事) | 10% | 8.2% |
| zh_pro(中文专业领域知识) | 3% | 2.1% |
## 数据源(共26个活跃源)
### 英语网络文本(占比22%)
- `fineweb-edu-10BT`(FineWeb-Edu 10BT采样子集)——42.6亿令牌
### 中文网络文本(占比20%)
- `skypile-150B`(SkyPile-150B采样子集)——37.2亿令牌
- `seq-monkey`(Mobvoi Seq-Monkey数据集)——1.8亿令牌
### 合成文本(占比13%)
- `cosmopedia`(HuggingFaceTB Cosmopedia数据集)——33.3亿令牌
- `benchmarks-pretrain`(MMLU / ARC / BoolQ / HellaSwag / PIQA / SIQA / Winogrande / OpenBookQA + C3 / CEval / ChID / CMMLU / CMRC2018,所有拆分合并为纯文本格式)——900万令牌
### 代码文本(占比15%)
- `github-code-py-js-ts`(codeparrot/github-code-clean,过滤为Python/JavaScript/TypeScript代码)——32.6亿令牌
### R1蒸馏推理文本(占比10%)
- `mxode-reasoning-distil`(中文思维链数据集)——3.0亿令牌
- `qwq-longcot-130k`——2.9亿令牌
- `chinese-r1-110k`(Congliu Chinese-DeepSeek-R1-Distill数据集)——2.2亿令牌
- `bespoke-stratos-17k`——0.9亿令牌
- `zake-openscience-zh`(中文科学推理数据集)——0.7亿令牌
- `open-thoughts-114k`——7.6亿令牌
- `qwq/s1K/LIMO`(多种思维链数据源,英语)
### 数学文本(占比7%)
- `openwebmath`——13.6亿令牌
- `numinamath-cot`(MATH数据集+奥赛题)——~0.9亿令牌
- `mxode-cmid-math`(中文物理/数学解题数据集)——1.2亿令牌
- `mxode-school-math`(中文K12数学思维链数据集)——0.5亿令牌
- `almonster-mathinstruct-zh`——40万令牌
### 叙事文本(占比10%)——v3.3版本新增,用于抵消百科/维基文本的偏差
- `gutenberg-en`(sedthh/gutenberg_english,公有领域英语文学)——8.7亿令牌
- `webnovel-zh`(wdndev/webnovel-chinese,10个分片中的6个)——6.9亿令牌
- `tinystories`(roneneldan/TinyStories数据集)——0.8亿令牌
### 中文专业领域文本(占比3%)
- `zhihu-kol`(wangrui6/Zhihu-KOL数据集)——2.7亿令牌
- `medical-zh`(shibing624/medical预训练子集)——~0.5亿令牌
- `coig-cqia`(m-a-p/COIG-CQIA数据集)——~0.5亿令牌
- `belle-math`(BelleGroup/school_math_0.25M数据集)——~0.3亿令牌
## 数据处理流程
1. **流程1 — 按源执行伯努利下采样至暂存池**
- 流式读取每个数据源(Parquet / JSONL / HF-Arrow格式)
- 过滤字符数少于200的文档
- 针对R1蒸馏与对话风格数据源:采用ChatML格式封装(`<|im_start|>角色
…<|im_end|>`),并使用`<think>推理过程</think>`封装显式思维链内容
- 选择伯努利`keep_prob`参数,使下采样后的暂存池匹配目标预算
2. **流程2 — 加权抽取交错(支持任意位置截断)**
- 每次抽取时,以概率`target_w[cat_i] × rows_i / Σ_{j ∈ cat} rows_j × avg_tok_j`选择源`i`
- 保证在任意分片前缀下,类别令牌占比均近似目标值
- 无非重复采样:最终输出中每个文档最多出现1次(所有源的采样次数均为1)
3. **分片布局**
- 每个分片包含50万文档 → 共约37个分片,每个约1GB
- 交错过程中跨源打乱数据顺序
## 语言级别分布
| 语言 | 令牌数 | 占比 |
|---|---|---|
| 英语 | 111.2亿 | 55.6% |
| 中文 | 56.1亿 | 28.1% |
| 代码 | 32.6亿 | 16.3% |
## 预期用途
用于预训练**NeuronSpark v3**脉冲神经网络语言模型:
- 参数量为10亿,采用受生物启发的PLIF神经元与选择性放电机制
- 搭载PonderNet-v3动态逐令牌K值计算(Gumbel-ST + 强制探索)
- 对矩阵参数使用Muon优化器,其余参数使用AuxAdam优化器
- 采用DeepSpeed ZeRO-0分布式训练框架以适配Muon优化器
## 构建脚本
详见NeuronSpark-V1仓库中的`scripts/v3_data/`目录:
- `manifest.py` — 数据源声明与目标权重配置
- `build_pretrain_mix.py` — 流程1下采样与流程2交错实现
- `build_benchmark_pretrain.py` — 将Hugging Face评估基准数据集转换为纯文本格式
- `download_extra.py` — 拉取外部数据源(github-code、gutenberg等)
## 注意事项
- **仅按源过滤,未执行额外去重**:上游数据源已完成去重,本数据集未新增去重操作。
- **基准数据集污染**:根据用户指示,未对训练集与测试集进行区分,直接混合使用。基准评估子集(所有拆分合并后)已被纳入预训练集,因此在解释下游评估分数时需注意这一点。
- `lambada_openai`数据集已被永久排除(遵循项目政策)。
提供机构:
Brain2nd



