reasoning-degeneration-dev/prepretraining-training-samples-v1
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/reasoning-degeneration-dev/prepretraining-training-samples-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- prepretraining
- data-inspection
- training-samples
---
# prepretraining-training-samples-v1
First 10 training sequences (4096 tokens each) for each of the 4 conditions, decoded back to human-readable text. Shows exactly what the model sees during training. Use this to verify data quality, ordering, and condition differentiation.
## Dataset Info
- **Rows**: 40
- **Columns**: 8
## Columns
| Column | Type | Description |
|--------|------|-------------|
| condition | Value('string') | Data scheduling condition: baseline, front-load, constant-mix, or anneal |
| condition_description | Value('string') | What this condition feeds the model during its first phase |
| sequence_index | Value('int64') | Index of this sequence within the condition (0 = very first thing the model sees) |
| text | Value('string') | Decoded text of the 4096-token training sequence (human-readable) |
| n_tokens | Value('int64') | Number of tokens in this sequence (always 4096) |
| n_chars | Value('int64') | Character count of the decoded text |
| n_documents | Value('int64') | Number of documents packed into this sequence (separated by EOS tokens) |
| source_files | Value('string') | Source .npy file names (first 3) |
## Generation Parameters
```json
{
"script_name": "analysis/sample_training_batches.py",
"model": "N/A (data inspection, not model output)",
"description": "First 10 training sequences (4096 tokens each) for each of the 4 conditions, decoded back to human-readable text. Shows exactly what the model sees during training. Use this to verify data quality, ordering, and condition differentiation.",
"hyperparameters": {
"sequence_length": 4096,
"n_sequences_per_condition": 10,
"tokenizer": "allenai/gpt-neox-olmo-dolma-v1_5"
},
"input_datasets": [
"reasoning-degeneration-dev/prepretraining-gold-v1",
"reasoning-degeneration-dev/prepretraining-web-v1"
],
"experiment_id": "prepretraining",
"artifact_type": "input_data",
"visualizer_type": "table",
"artifact_group": "data-inspection"
}
```
## Experiment Documentation
For complete experiment details, see [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining)
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("reasoning-degeneration-dev/prepretraining-training-samples-v1", split="train")
print(f"Loaded {len(dataset)} rows")
```
---
*This dataset is tracked in [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST)*
提供机构:
reasoning-degeneration-dev



