five

reasoning-degeneration-dev/prepretraining-gold-v1

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/reasoning-degeneration-dev/prepretraining-gold-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit tags: - prepretraining - gold-data - tokenized --- # prepretraining-gold-v1 Gold (high-quality) tokenized data for pre-pretraining experiments. Composition: 30% Cosmopedia, 25% FineMath-4+, 20% Tulu 3, 15% Python-Edu, 10% peS2o. GPT-NeoX tokenizer, uint32 .npy format. ## Dataset Info - **Rows**: 12 - **Columns**: 3 ## Columns | Column | Type | Description | |--------|------|-------------| | filename | Value('string') | Name of the .npy file | | token_count | Value('int64') | Number of uint32 token IDs in the file | | size_bytes | Value('int64') | File size in bytes | ## Generation Parameters ```json { "script_name": "data/upload_data.py", "model": "N/A (pre-tokenized training data, not model outputs)", "description": "Gold (high-quality) tokenized data for pre-pretraining experiments. Composition: 30% Cosmopedia, 25% FineMath-4+, 20% Tulu 3, 15% Python-Edu, 10% peS2o. GPT-NeoX tokenizer, uint32 .npy format.", "tokenizer": "allenai/gpt-neox-olmo-dolma-v1_5", "format": "Pre-tokenized uint32 .npy memmap arrays for OLMo-core", "vocab_size": 50280, "eos_token_id": 50279, "total_target_tokens": 500000000, "composition": { "cosmopedia": 0.3, "finemath": 0.25, "tulu3": 0.2, "fineweb_edu": 0.15, "cosmopedia_stanford": 0.1 }, "seed": 42, "combined_train_tokens": 500000000, "combined_held_out_tokens": 5000000, "combined_train_file": "data/gold/gold_combined.npy", "combined_held_out_file": "data/gold/held_out_gold.npy", "hyperparameters": {}, "input_datasets": [] } ``` ## Experiment Documentation For complete experiment details, see [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining) ## Usage ```python from datasets import load_dataset dataset = load_dataset("reasoning-degeneration-dev/prepretraining-gold-v1", split="train") print(f"Loaded {len(dataset)} rows") ``` --- *This dataset is tracked in [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST)*

license: MIT许可证 tags: - 预预训练(prepretraining) - 黄金数据(gold-data) - 已分词(tokenized) --- # 预预训练黄金数据集v1(prepretraining-gold-v1) 本数据集为用于预预训练实验的高质量(黄金标准)已分词数据。数据集构成:30% Cosmopedia、25% FineMath-4+、20% Tulu 3、15% Python-Edu、10% peS2o。采用GPT-NeoX分词器,数据格式为uint32类型的.npy文件。 ## 数据集信息 - **数据行数**:12 - **列数**:3 ## 字段说明 | 字段名 | 数据类型 | 描述 | |--------|------|-------------| | filename | Value('string') | .npy文件名 | | token_count | Value('int64') | 文件中uint32类型的分词ID总数 | | size_bytes | Value('int64') | 文件大小(单位:字节) | ## 生成参数 json { "脚本名称": "data/upload_data.py", "模型": "N/A(本数据集为预分词训练数据,非模型输出)", "描述": "用于预预训练实验的高质量(黄金标准)已分词数据。数据集构成:30% Cosmopedia、25% FineMath-4+、20% Tulu 3、15% Python-Edu、10% peS2o。采用GPT-NeoX分词器,数据格式为uint32类型的.npy文件。", "分词器": "allenai/gpt-neox-olmo-dolma-v1_5", "格式": "适用于OLMo-core的预分词uint32类型.npy内存映射数组", "词汇表大小": 50280, "结束分词ID": 50279, "目标总分词数": 500000000, "数据集构成": { "cosmopedia": 0.3, "finemath": 0.25, "tulu3": 0.2, "fineweb_edu": 0.15, "cosmopedia_stanford": 0.1 }, "随机种子": 42, "合并训练集总分词数": 500000000, "合并保留集总分词数": 5000000, "合并训练集文件路径": "data/gold/gold_combined.npy", "合并保留集文件路径": "data/gold/held_out_gold.npy", "超参数": {}, "输入数据集": [] } ## 实验文档 如需完整实验细节,请参阅 [https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining](https://github.com/Zayne-sprague/SC-Research-Notes/tree/main/experiments/prepretraining) ## 使用方法 python from datasets import load_dataset dataset = load_dataset("reasoning-degeneration-dev/prepretraining-gold-v1", split="train") print(f"已加载 {len(dataset)} 条数据") --- *本数据集已在 [reasoning-degeneration-dev/PROJECT-MANIFEST](https://huggingface.co/datasets/reasoning-degeneration-dev/PROJECT-MANIFEST) 中完成追踪*
提供机构:
reasoning-degeneration-dev
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作