five

reasoning-degeneration-dev/prepretraining-training-samples-v1

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/reasoning-degeneration-dev/prepretraining-training-samples-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit tags: - prepretraining - data-inspection - training-samples --- # prepretraining-training-samples-v1 First 10 training sequences (4096 tokens each) for each of the 4 conditions, decoded back to human-readable text. Shows exactly what the model sees during training. Use this to verify data quality, ordering, and condition differentiation. ## Dataset Info - **Rows**: 40 - **Columns**: 8 ## Columns | Column | Type | Description | |--------|------|-------------| | condition | Value('string') | Data scheduling condition: baseline, front-load, constant-mix, or anneal | | condition_description | Value('string') | What this condition feeds the model during its first phase | | sequence_index | Value('int64') | Index of this sequence within the condition (0 = very first thing the model sees) | | text | Value('string') | Decoded text of the 4096-token training sequence (human-readable) | | n_tokens | Value('int64') | Number of tokens in this sequence (always 4096) | | n_chars | Value('int64') | Character count of the decoded text | | n_documents | Value('int64') | Number of documents packed into this sequence (separated by EOS tokens) | | source_files | Value('string') | Source .npy file names (first 3) | ## Generation Parameters ```json { "script_name": "analysis/sample_training_batches.py", "model": "N/A (data inspection, not model output)", "description": "First 10 training sequences (4096 tokens each) for each of the 4 conditions, decoded back to human-readable text. Shows exactly what the model sees during training. Use this to verify data quality, ordering, and condition differentiation.", "hyperparameters": { "sequence_length": 4096, "n_sequences_per_condition": 10, "tokenizer": "allenai/gpt-neox-olmo-dolma-v1_5" }, "input_datasets": [ "reasoning-degeneration-dev/prepretraining-gold-v1", "reasoning-degeneration-dev/prepretraining-web-v1" ], "experiment_id": "prepretraining", "artifact_type": "input_data", "visualizer_type": "table", "artifact_group": "data-inspection" } ``` ## Experiment Documentation For complete experiment details, see [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining) ## Usage ```python from datasets import load_dataset dataset = load_dataset("reasoning-degeneration-dev/prepretraining-training-samples-v1", split="train") print(f"Loaded {len(dataset)} rows") ``` --- *This dataset is tracked in [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST)*
提供机构:
reasoning-degeneration-dev
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作