Jaikirat/fineweb10B_sp8192
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Jaikirat/fineweb10B_sp8192
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: fineweb10B_sp8192
task_categories:
- text-generation
language:
- en
license: odc-by
size_categories:
- 10M<n<100M
configs:
- config_name: default
data_files:
- split: train
path: datasets/fineweb10B_sp8192/fineweb_train_*.bin
- split: validation
path: datasets/fineweb10B_sp8192/fineweb_val_*.bin
---
# Jaikirat/fineweb10B_sp8192
This repo contains a SentencePiece BPE `SP8192` retokenization of the Parameter Golf FineWeb docs cache, exported into `.bin` shards for training.
## Contents
- `tokenizers/fineweb_8192_bpe.model`
- `tokenizers/fineweb_8192_bpe.vocab`
- `datasets/fineweb10B_sp8192/fineweb_train_*.bin`
- `datasets/fineweb10B_sp8192/fineweb_val_*.bin`
- `manifest.json`
## Source
- upstream docs repo: [`willdepueoai/parameter-golf`](https://huggingface.co/datasets/willdepueoai/parameter-golf)
- upstream subdir: `datasets`
- upstream docs file: `datasets/docs_selected.jsonl`
- upstream source export root recorded in sidecar: `/root/exports/fineweb_50Bsub100B_50keval_v0`
- docs sha256: `84386dfa7b339a5d4831d5273c4a2028b78b60670d3a235633a8520545d19bc7`
- selection seed: `1337`
- upstream license: [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/)
- upstream terms referenced by the source dataset: [Common Crawl Terms of Use](https://commoncrawl.org/terms-of-use)
## Tokenizer
- kind: `sentencepiece_bpe`
- vocab size: `8192`
- bos id: `1`
- eos id: `2`
## Export Stats
- total docs seen: `15368808`
- validation docs: `50000`
- train docs: `15318808`
- total files: `130`
- validation files: `1`
- train files: `129`
- total tokens: `12748384091`
- validation tokens: `40541268`
- train tokens: `12707842823`
- shard size: `100000000`
## Notes
- Validation is stored as one or more `fineweb_val_*.bin` shards.
- Training is stored as `fineweb_train_*.bin` shards with a trailing partial shard when needed.
- Tokens are stored as `uint16` with a 256-int header matching the local exporter format.
- This dataset is a retokenized derivative of the upstream docs cache, not an original corpus release.
- Redistribution and use should preserve upstream attribution requirements from ODC-By.
提供机构:
Jaikirat



