abanm/Pretrain_1
收藏Hugging Face2025-11-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/abanm/Pretrain_1
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Pretrain_1
tags:
- pretraining
- raw-corpus
task_categories:
- text-generation
language:
- en
size_categories:
- 1B<n<10B
---
# Pretrain_1
Dataset Summary
This corpus aggregates short/medium-length English text from multiple public sources chosen for cleanliness, diversity, and token efficiency. Emphasis is placed on:
Short sequences (e.g., 8–384 tokens) for models with modest context windows,
Surface robustness (grammar/tense, split/rephrase),
Stepwise reasoning (elementary → competition math),
Lexical coverage (dictionary triples, wordlists, numbers),
Exact GPT-2 token counts, published per file and per bucket.
pretty_name: 预训练语料集1(Pretrain_1)
tags:
- 预训练
- 原始语料库
task_categories:
- 文本生成
language:
- 英语
size_categories:
- 10亿词元(Token)< 数据规模 < 100亿词元(Token)
# 预训练语料集1
## 数据集概览
本语料集从多个公开数据源中聚合了短/中等长度的英文文本,所有数据源均以文本洁净度、多样性与词元(Token)效率为核心筛选标准。本次构建重点关注以下方向:
短序列(例如8至384个词元(Token)),适配上下文窗口有限的模型;
文本表层鲁棒性(涵盖语法/时态、分句/改写规则);
渐进式推理训练(从基础数学延伸至竞赛数学场景);
词汇覆盖度优化(包含词典三元组、词表与数字文本);
精确匹配GPT-2词元计数标准,相关统计信息将按文件与分桶(bucket)逐一公布。
提供机构:
abanm



