five

Jaikirat/fineweb10B_sp8192

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Jaikirat/fineweb10B_sp8192
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: fineweb10B_sp8192 task_categories: - text-generation language: - en license: odc-by size_categories: - 10M<n<100M configs: - config_name: default data_files: - split: train path: datasets/fineweb10B_sp8192/fineweb_train_*.bin - split: validation path: datasets/fineweb10B_sp8192/fineweb_val_*.bin --- # Jaikirat/fineweb10B_sp8192 This repo contains a SentencePiece BPE `SP8192` retokenization of the Parameter Golf FineWeb docs cache, exported into `.bin` shards for training. ## Contents - `tokenizers/fineweb_8192_bpe.model` - `tokenizers/fineweb_8192_bpe.vocab` - `datasets/fineweb10B_sp8192/fineweb_train_*.bin` - `datasets/fineweb10B_sp8192/fineweb_val_*.bin` - `manifest.json` ## Source - upstream docs repo: [`willdepueoai/parameter-golf`](https://huggingface.co/datasets/willdepueoai/parameter-golf) - upstream subdir: `datasets` - upstream docs file: `datasets/docs_selected.jsonl` - upstream source export root recorded in sidecar: `/root/exports/fineweb_50Bsub100B_50keval_v0` - docs sha256: `84386dfa7b339a5d4831d5273c4a2028b78b60670d3a235633a8520545d19bc7` - selection seed: `1337` - upstream license: [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/) - upstream terms referenced by the source dataset: [Common Crawl Terms of Use](https://commoncrawl.org/terms-of-use) ## Tokenizer - kind: `sentencepiece_bpe` - vocab size: `8192` - bos id: `1` - eos id: `2` ## Export Stats - total docs seen: `15368808` - validation docs: `50000` - train docs: `15318808` - total files: `130` - validation files: `1` - train files: `129` - total tokens: `12748384091` - validation tokens: `40541268` - train tokens: `12707842823` - shard size: `100000000` ## Notes - Validation is stored as one or more `fineweb_val_*.bin` shards. - Training is stored as `fineweb_train_*.bin` shards with a trailing partial shard when needed. - Tokens are stored as `uint16` with a 256-int header matching the local exporter format. - This dataset is a retokenized derivative of the upstream docs cache, not an original corpus release. - Redistribution and use should preserve upstream attribution requirements from ODC-By.
提供机构:
Jaikirat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作