reasoning-degeneration-dev/prepretraining-web-v1
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/reasoning-degeneration-dev/prepretraining-web-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- prepretraining
- web-data
- fineweb
- tokenized
---
# prepretraining-web-v1
FineWeb (unfiltered) tokenized web data for pre-pretraining experiments. Split into 1B-token .npy chunks. GPT-NeoX tokenizer, uint32 format.
## Dataset Info
- **Rows**: 10
- **Columns**: 3
## Columns
| Column | Type | Description |
|--------|------|-------------|
| filename | Value('string') | Name of the .npy file |
| token_count | Value('int64') | Number of uint32 token IDs in the file |
| size_bytes | Value('int64') | File size in bytes |
## Generation Parameters
```json
{
"script_name": "data/upload_data.py",
"model": "N/A (pre-tokenized training data, not model outputs)",
"description": "FineWeb (unfiltered) tokenized web data for pre-pretraining experiments. Split into 1B-token .npy chunks. GPT-NeoX tokenizer, uint32 format.",
"tokenizer": "allenai/gpt-neox-olmo-dolma-v1_5",
"format": "Pre-tokenized uint32 .npy memmap arrays for OLMo-core",
"source": "HuggingFaceFW/fineweb",
"subset": "sample-10BT",
"vocab_size": 50280,
"eos_token_id": 50279,
"total_train_tokens": 8000000140,
"held_out_tokens": 1000000,
"chunk_size": 1000000000,
"num_chunks": 9,
"seed": 42,
"hyperparameters": {},
"input_datasets": []
}
```
## Experiment Documentation
For complete experiment details, see [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining)
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("reasoning-degeneration-dev/prepretraining-web-v1", split="train")
print(f"Loaded {len(dataset)} rows")
```
---
*This dataset is tracked in [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST)*
提供机构:
reasoning-degeneration-dev



