five

Brain2nd/NeuronSpark-V1

收藏
Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Brain2nd/NeuronSpark-V1
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh license: apache-2.0 task_categories: - text-generation tags: - pretraining - bilingual - snn - neuromorphic size_categories: - 10B<n<100B --- # NeuronSpark-V1 Pretraining Dataset Bilingual (English + Chinese) pretraining corpus for NeuronSpark, a bio-inspired Spiking Neural Network language model. ## Dataset Summary | Metric | Value | |---|---| | Total documents | 17,174,734 | | Estimated tokens | ~14.5B | | Languages | English (55%), Chinese (42%), Bilingual Math (3%) | | Format | Parquet (35 shards, ~39 GB) | | Columns | `text` (string), `source` (string) | ## Sources & Composition | Source | Documents | Ratio | Est. Tokens | Description | |---|---|---|---|---| | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 6,810,451 | 39.7% | ~7B | High-quality English educational web text | | [SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B) | 7,173,310 | 41.8% | ~4.5B | High-quality Chinese web text | | [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) | 2,313,934 | 13.5% | ~1.5B | Synthetic English textbooks & articles | | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | 792,380 | 4.6% | ~1.5B | English mathematical web text | | [BelleGroup/school_math_0.25M](https://huggingface.co/datasets/BelleGroup/school_math_0.25M) | 84,659 | 0.5% | ~17M | Chinese math problem-solving | ## Processing - **Minimum length filter**: Documents shorter than 200 characters are removed - **Sampling**: Each source is sampled to target token count using reservoir sampling - **Shuffling**: Documents are shuffled within each output shard - **No deduplication** across sources (each source is pre-deduplicated upstream) ## Intended Use Pre-training a 0.6B-parameter bilingual SNN language model (NeuronSpark). The dataset is designed to provide: - General knowledge from web text (English + Chinese) - Mathematical reasoning from dedicated math corpora - Structured knowledge from synthetic textbooks ## Train Tokenizer Train a 64K-vocab BPE tokenizer on this dataset: ```bash pip install tokenizers transformers pandas tqdm # Clone this dataset # git clone https://huggingface.co/datasets/Brain2nd/NeuronSpark-V1 # cd NeuronSpark-V1 python scripts/train_tokenizer.py \ --data_dir data/pretrain_mix \ --save_dir tokenizer \ --vocab_size 64000 \ --sample_docs 500000 ``` The script samples documents from the parquet shards, then trains a ByteLevel BPE tokenizer. Adjust `--sample_docs` based on available RAM: | sample_docs | Corpus size | RAM needed | Quality | |---|---|---|---| | 200,000 | ~0.8 GB | ~8 GB | Good | | 500,000 | ~2 GB | ~16 GB | Better | | 2,000,000 | ~8 GB | ~64 GB | Best | Special tokens: `<unk>` (0), `<s>` (1), `</s>` (2), `<|im_start|>` (3), `<|im_end|>` (4), `<|pad|>` (5) ## License This dataset is a curated mixture of publicly available datasets. Please refer to the individual source licenses: - FineWeb-Edu: ODC-BY 1.0 - SkyPile-150B: Skywork Community License - Cosmopedia: Apache 2.0 - OpenWebMath: ODC-BY 1.0 - BelleGroup/school_math: GPL-3.0
提供机构:
Brain2nd
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作