jhu-clsp/ettin-pretraining-data
收藏Hugging Face2025-07-18 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/jhu-clsp/ettin-pretraining-data
下载链接
链接失效反馈官方服务:
资源简介:
Ettin预训练数据集包含了用于训练Ettin编码器和解码器模型的预训练阶段数据。数据集由多样化的数据混合而成,总共有1.7T tokens,包括高质量的网页爬取数据、Common Crawl头部文档、代码仓库和文件、社交媒体讨论线程、科学论文、学术预印本、问答论坛、指令跟随数据、数学内容等。数据以MDS格式提供,适用于Composer和ModernBERT训练库。
The Ettin Pre-training Data contains the pre-training phase data used to train all Ettin encoder and decoder models. The dataset consists of a diverse mixture of data sources totaling 1.7T tokens, including high-quality web crawl data, Common Crawl head documents, code repositories and files, social discussion threads, scientific papers, academic preprints, Q&A forums, instruction-following data, mathematical content, and more. The data is provided in MDS format, suitable for use with Composer and ModernBERT training libraries.
提供机构:
jhu-clsp



