reasoning-degeneration-dev/prepretraining-gold-v1
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/reasoning-degeneration-dev/prepretraining-gold-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- prepretraining
- gold-data
- tokenized
---
# prepretraining-gold-v1
Gold (high-quality) tokenized data for pre-pretraining experiments. Composition: 30% Cosmopedia, 25% FineMath-4+, 20% Tulu 3, 15% Python-Edu, 10% peS2o. GPT-NeoX tokenizer, uint32 .npy format.
## Dataset Info
- **Rows**: 12
- **Columns**: 3
## Columns
| Column | Type | Description |
|--------|------|-------------|
| filename | Value('string') | Name of the .npy file |
| token_count | Value('int64') | Number of uint32 token IDs in the file |
| size_bytes | Value('int64') | File size in bytes |
## Generation Parameters
```json
{
"script_name": "data/upload_data.py",
"model": "N/A (pre-tokenized training data, not model outputs)",
"description": "Gold (high-quality) tokenized data for pre-pretraining experiments. Composition: 30% Cosmopedia, 25% FineMath-4+, 20% Tulu 3, 15% Python-Edu, 10% peS2o. GPT-NeoX tokenizer, uint32 .npy format.",
"tokenizer": "allenai/gpt-neox-olmo-dolma-v1_5",
"format": "Pre-tokenized uint32 .npy memmap arrays for OLMo-core",
"vocab_size": 50280,
"eos_token_id": 50279,
"total_target_tokens": 500000000,
"composition": {
"cosmopedia": 0.3,
"finemath": 0.25,
"tulu3": 0.2,
"fineweb_edu": 0.15,
"cosmopedia_stanford": 0.1
},
"seed": 42,
"combined_train_tokens": 500000000,
"combined_held_out_tokens": 5000000,
"combined_train_file": "data/gold/gold_combined.npy",
"combined_held_out_file": "data/gold/held_out_gold.npy",
"hyperparameters": {},
"input_datasets": []
}
```
## Experiment Documentation
For complete experiment details, see [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining)
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("reasoning-degeneration-dev/prepretraining-gold-v1", split="train")
print(f"Loaded {len(dataset)} rows")
```
---
*This dataset is tracked in [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST)*
license: MIT许可证
tags:
- 预预训练(prepretraining)
- 黄金数据(gold-data)
- 已分词(tokenized)
---
# 预预训练黄金数据集v1(prepretraining-gold-v1)
本数据集为用于预预训练实验的高质量(黄金标准)已分词数据。数据集构成:30% Cosmopedia、25% FineMath-4+、20% Tulu 3、15% Python-Edu、10% peS2o。采用GPT-NeoX分词器,数据格式为uint32类型的.npy文件。
## 数据集信息
- **数据行数**:12
- **列数**:3
## 字段说明
| 字段名 | 数据类型 | 描述 |
|--------|------|-------------|
| filename | Value('string') | .npy文件名 |
| token_count | Value('int64') | 文件中uint32类型的分词ID总数 |
| size_bytes | Value('int64') | 文件大小(单位:字节) |
## 生成参数
json
{
"脚本名称": "data/upload_data.py",
"模型": "N/A(本数据集为预分词训练数据,非模型输出)",
"描述": "用于预预训练实验的高质量(黄金标准)已分词数据。数据集构成:30% Cosmopedia、25% FineMath-4+、20% Tulu 3、15% Python-Edu、10% peS2o。采用GPT-NeoX分词器,数据格式为uint32类型的.npy文件。",
"分词器": "allenai/gpt-neox-olmo-dolma-v1_5",
"格式": "适用于OLMo-core的预分词uint32类型.npy内存映射数组",
"词汇表大小": 50280,
"结束分词ID": 50279,
"目标总分词数": 500000000,
"数据集构成": {
"cosmopedia": 0.3,
"finemath": 0.25,
"tulu3": 0.2,
"fineweb_edu": 0.15,
"cosmopedia_stanford": 0.1
},
"随机种子": 42,
"合并训练集总分词数": 500000000,
"合并保留集总分词数": 5000000,
"合并训练集文件路径": "data/gold/gold_combined.npy",
"合并保留集文件路径": "data/gold/held_out_gold.npy",
"超参数": {},
"输入数据集": []
}
## 实验文档
如需完整实验细节,请参阅 [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining)
## 使用方法
python
from datasets import load_dataset
dataset = load_dataset("reasoning-degeneration-dev/prepretraining-gold-v1", split="train")
print(f"已加载 {len(dataset)} 条数据")
---
*本数据集已在 [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST) 中完成追踪*
提供机构:
reasoning-degeneration-dev



