reasoning-degeneration-dev/prepretraining-sft-v1
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/reasoning-degeneration-dev/prepretraining-sft-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- prepretraining
- sft
- tulu3
---
# prepretraining-sft-v1
Tulu 3 SFT subset (50K examples) for pre-pretraining eval. Applied identically to all experimental conditions.
## Dataset Info
- **Rows**: 50000
- **Columns**: 1
## Columns
| Column | Type | Description |
|--------|------|-------------|
| messages | List({'content': Value('string'), 'role': Value('string')}) | Conversation turns [{role, content}] in Tulu 3 format |
## Generation Parameters
```json
{
"script_name": "data/upload_data.py",
"model": "N/A (SFT training data, not model outputs)",
"description": "Tulu 3 SFT subset (50K examples) for pre-pretraining eval. Applied identically to all experimental conditions.",
"source": "allenai/tulu-3-sft-mixture",
"n_examples": 50000,
"total_turns": 118682,
"avg_turns_per_example": 2.37,
"seed": 42,
"output_file": "data/sft/tulu3_50k.jsonl",
"hyperparameters": {},
"input_datasets": []
}
```
## Experiment Documentation
For complete experiment details, see [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining)
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("reasoning-degeneration-dev/prepretraining-sft-v1", split="train")
print(f"Loaded {len(dataset)} rows")
```
---
*This dataset is tracked in [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST)*
提供机构:
reasoning-degeneration-dev



