five

reasoning-degeneration-dev/prepretraining-web-v1

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/reasoning-degeneration-dev/prepretraining-web-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit tags: - prepretraining - web-data - fineweb - tokenized --- # prepretraining-web-v1 FineWeb (unfiltered) tokenized web data for pre-pretraining experiments. Split into 1B-token .npy chunks. GPT-NeoX tokenizer, uint32 format. ## Dataset Info - **Rows**: 10 - **Columns**: 3 ## Columns | Column | Type | Description | |--------|------|-------------| | filename | Value('string') | Name of the .npy file | | token_count | Value('int64') | Number of uint32 token IDs in the file | | size_bytes | Value('int64') | File size in bytes | ## Generation Parameters ```json { "script_name": "data/upload_data.py", "model": "N/A (pre-tokenized training data, not model outputs)", "description": "FineWeb (unfiltered) tokenized web data for pre-pretraining experiments. Split into 1B-token .npy chunks. GPT-NeoX tokenizer, uint32 format.", "tokenizer": "allenai/gpt-neox-olmo-dolma-v1_5", "format": "Pre-tokenized uint32 .npy memmap arrays for OLMo-core", "source": "HuggingFaceFW/fineweb", "subset": "sample-10BT", "vocab_size": 50280, "eos_token_id": 50279, "total_train_tokens": 8000000140, "held_out_tokens": 1000000, "chunk_size": 1000000000, "num_chunks": 9, "seed": 42, "hyperparameters": {}, "input_datasets": [] } ``` ## Experiment Documentation For complete experiment details, see [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining) ## Usage ```python from datasets import load_dataset dataset = load_dataset("reasoning-degeneration-dev/prepretraining-web-v1", split="train") print(f"Loaded {len(dataset)} rows") ``` --- *This dataset is tracked in [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST)*
提供机构:
reasoning-degeneration-dev
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作