Brain2nd/NeuronSpark-V1

Name: Brain2nd/NeuronSpark-V1
Creator: Brain2nd
Published: 2026-03-18 03:21:13
License: 暂无描述

Hugging Face2026-03-18 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Brain2nd/NeuronSpark-V1

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - zh license: apache-2.0 task_categories: - text-generation tags: - pretraining - bilingual - snn - neuromorphic size_categories: - 10B<n<100B --- # NeuronSpark-V1 Pretraining Dataset Bilingual (English + Chinese) pretraining corpus for NeuronSpark, a bio-inspired Spiking Neural Network language model. ## Dataset Summary | Metric | Value | |---|---| | Total documents | 17,174,734 | | Estimated tokens | ~14.5B | | Languages | English (55%), Chinese (42%), Bilingual Math (3%) | | Format | Parquet (35 shards, ~39 GB) | | Columns | `text` (string), `source` (string) | ## Sources & Composition | Source | Documents | Ratio | Est. Tokens | Description | |---|---|---|---|---| | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 6,810,451 | 39.7% | ~7B | High-quality English educational web text | | [SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B) | 7,173,310 | 41.8% | ~4.5B | High-quality Chinese web text | | [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) | 2,313,934 | 13.5% | ~1.5B | Synthetic English textbooks & articles | | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | 792,380 | 4.6% | ~1.5B | English mathematical web text | | [BelleGroup/school_math_0.25M](https://huggingface.co/datasets/BelleGroup/school_math_0.25M) | 84,659 | 0.5% | ~17M | Chinese math problem-solving | ## Processing - **Minimum length filter**: Documents shorter than 200 characters are removed - **Sampling**: Each source is sampled to target token count using reservoir sampling - **Shuffling**: Documents are shuffled within each output shard - **No deduplication** across sources (each source is pre-deduplicated upstream) ## Intended Use Pre-training a 0.6B-parameter bilingual SNN language model (NeuronSpark). The dataset is designed to provide: - General knowledge from web text (English + Chinese) - Mathematical reasoning from dedicated math corpora - Structured knowledge from synthetic textbooks ## Train Tokenizer Train a 64K-vocab BPE tokenizer on this dataset: ```bash pip install tokenizers transformers pandas tqdm # Clone this dataset # git clone https://huggingface.co/datasets/Brain2nd/NeuronSpark-V1 # cd NeuronSpark-V1 python scripts/train_tokenizer.py \ --data_dir data/pretrain_mix \ --save_dir tokenizer \ --vocab_size 64000 \ --sample_docs 500000 ``` The script samples documents from the parquet shards, then trains a ByteLevel BPE tokenizer. Adjust `--sample_docs` based on available RAM: | sample_docs | Corpus size | RAM needed | Quality | |---|---|---|---| | 200,000 | ~0.8 GB | ~8 GB | Good | | 500,000 | ~2 GB | ~16 GB | Better | | 2,000,000 | ~8 GB | ~64 GB | Best | Special tokens: `<unk>` (0), `<s>` (1), `</s>` (2), `<|im_start|>` (3), `<|im_end|>` (4), `<|pad|>` (5) ## License This dataset is a curated mixture of publicly available datasets. Please refer to the individual source licenses: - FineWeb-Edu: ODC-BY 1.0 - SkyPile-150B: Skywork Community License - Cosmopedia: Apache 2.0 - OpenWebMath: ODC-BY 1.0 - BelleGroup/school_math: GPL-3.0

提供机构：

Brain2nd

5,000+

优质数据集

54 个

任务类型

进入经典数据集