ettin-decay-data

Name: ettin-decay-data
Creator: maas
Published: 2025-12-05 11:43:42
License: 暂无描述

魔搭社区2025-12-05 更新2025-09-13 收录

下载链接：

https://modelscope.cn/datasets/jhu-clsp/ettin-decay-data

下载链接

链接失效反馈

官方服务：

资源简介：

# Ettin Decay Phase Data [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2507.11412) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-12%20Models-blue)](https://huggingface.co/jhu-clsp) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) > **Phase 3 of 3**: Premium data sources for final training phase (100B tokens) following the ProLong recipe. This dataset contains the decay phase data used to train all [Ettin encoder and decoder models](https://huggingface.co/jhu-clsp). This final phase uses **premium data sources** with emphasis on **long-form content** and **educational materials**. The data is provided in **MDS format** ready for use with [Composer](https://github.com/mosaicml/composer) and the [ModernBERT training repository](https://github.com/answerdotai/ModernBERT). ## Abstract The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training. ## 📊 Data Composition | Data Source | Tokens (B) | Percentage | Description | |:------------|:-----------|:-----------|:------------| | DCLM (Dolmino) | 26.0 | 31.9% | Highest-quality web crawl data | | Code Repos | 20.2 | 24.7% | Premium code repositories | | Books | 10.5 | 12.9% | Literature and reference books | | Math (Dolmino) | 5.0 | 6.1% | Mathematical content (premium) | | StackExchange (Dolmino) | 4.0 | 4.9% | High-quality Q&A content | | Tulu Flan | 4.1 | 5.0% | Instruction-following data | | Arxiv | 3.0 | 3.7% | Academic preprints | | Wikipedia | 3.0 | 3.7% | Encyclopedia articles | | Textbooks | 0.5 | 0.6% | Educational textbooks | | **Total** | **81.6** | **100.0%** | Premium quality mixture | ## 🎯 Key Features of Decay Phase ### Training Characteristics - **Aggressive LR Decay**: Learning rate decays to 0.02 of peak - **Long Context**: Maintains 8K token sequences from mid-training - **Lower Masking**: 5% masking ratio (vs 30% earlier) for encoders - **Quality Over Quantity**: Focus on premium sources rather than scale ## 🚀 Usage Please see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT ### Direct Access ```python from streaming import StreamingDataset # Load the streaming dataset dataset = StreamingDataset( remote='https://huggingface.co/datasets/jhu-clsp/ettin-decay-data', local='/tmp/ettin-decay-data', shuffle=True ) # Access premium quality samples for sample in dataset: text = sample['text'] # High-quality, long-form content # Process your data... ``` ## 📁 Structure Each folder contains premium quality data sources in MDS format: - `arxiv/` - Academic papers from ArXiv - `books/` - Literature and reference books (expanded) - `books_2/` - Additional book collections - `code_repos/` - Premium code repositories - `dclm_dolmino/` - Highest-quality filtered web data - `math_dolmino/` - Premium mathematical content - `stackexchange_dolmino/` - Top-quality Q&A content - `stackexchange_dolmino_dup/` - Additional curated Q&A - `stackexchange_dolmino_dup_2/` - Extra Q&A collections - `textbooks/` - Educational textbook content - `textbooks_2/` - Additional textbook collections - `tulu_flan/` - Instruction-following examples - `wikipedia/` - Wikipedia articles ## 💡 Usage in Cross-Objective Training This decay phase data is also used for **cross-objective training** experiments: - **Decoder → Encoder**: Training decoders with MLM on this premium data - **Encoder → Decoder**: Training encoders with CLM on this premium data - **Extended Training**: 50B additional tokens for cross-objective experiments ## 🔗 Related Resources - **Models**: [Ettin Model Suite](https://huggingface.co/jhu-clsp) (17M-1B parameters) - **Phase 1**: [Pre-training Data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) (1.7T tokens) - **Phase 2**: [Mid-training Data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) (250B tokens) - **Training Order**: [Batch-level Data Order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - **Paper**: [Arxiv link](https://arxiv.org/abs/2507.11412) - **Code**: [GitHub Repository](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) ## Citation ```bibtex @misc{weller2025seqvsseqopen, title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2507.11412}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.11412}, } ```

# Ettin 衰减阶段数据集 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2507.11412) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-12%20Models-blue)](https://huggingface.co/jhu-clsp) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) > **Phase 3 of 3**: 遵循ProLong训练配方的最终训练阶段（1000亿Token）所用的优质数据源本数据集包含用于训练所有Ettin编码器与解码器模型的衰减阶段数据。本最终阶段采用**优质数据源**，重点关注**长文本内容**与**教育类素材**。数据以**MDS格式**提供，可直接用于[Composer](https://github.com/mosaicml/composer)与[ModernBERT训练仓库](https://github.com/answerdotai/ModernBERT)。 ## 摘要大语言模型（Large Language Model, LLM）社区几乎完全聚焦于仅解码器语言模型，因其更易于用于文本生成任务。但仍有大量研究者使用仅编码器模型完成分类、检索等任务。过往研究曾尝试对比这两类架构，但往往不得不基于参数规模、训练技术与数据集均存在差异的模型进行比较。我们推出了当前最优（State-of-the-art, SOTA）开源数据Ettin模型套件：包含从1700万参数到10亿参数的成对仅编码器与仅解码器模型，训练数据量最高达2万亿Token。通过为两类架构采用完全一致的训练配方，我们在各自参数规模下均获得了SOTA级别的训练方案：编码器模型性能超越ModernBERT，解码器模型性能则优于Llama 3.2与SmolLM2。与过往研究一致，我们发现仅编码器模型擅长分类与检索任务，而仅解码器模型则擅长生成式任务。但我们的研究表明，通过持续训练将解码器模型适配至编码器任务（反之亦然）的效果，远不如直接使用对应目标函数的模型：例如在多体裁自然语言推理（MNLI）任务上，4亿参数的编码器模型性能优于10亿参数的解码器模型，而在生成式任务上则恰好相反。我们将本研究的所有相关成果开源，包括训练数据、按检查点划分的训练顺序以及200余个模型检查点，以供后续研究分析或拓展训练的各个环节。 ## 📊 数据构成 | 数据源 | Token数（十亿） | 占比 | 描述 | |:------------|:-----------|:-----------|:------------| | DCLM（Dolmino） | 26.0 | 31.9% | 最高质量网页爬取数据 | | 代码仓库 | 20.2 | 24.7% | 优质代码仓库 | | 图书 | 10.5 | 12.9% | 文学与参考图书 | | 数学内容（Dolmino） | 5.0 | 6.1% | 优质数学内容 | | StackExchange（Dolmino） | 4.0 | 4.9% | 高质量问答内容 | | Tulu Flan | 4.1 | 5.0% | 指令遵循数据 | | ArXiv | 3.0 | 3.7% | 学术预印本 | | 维基百科 | 3.0 | 3.7% | 百科全书条目 | | 教科书 | 0.5 | 0.6% | 教育类教科书 | | **总计** | **81.6** | **100.0%** | 优质混合数据源 | ## 🎯 衰减阶段核心特性 ### 训练特性 - **激进学习率衰减**：学习率衰减至峰值的0.02倍 - **长上下文支持**：从训练中期开始保留8192（8K）Token的序列长度 - **更低掩码率**：编码器模型的掩码比例为5%（此前为30%） - **质量优先于数量**：聚焦优质数据源而非数据规模 ## 🚀 使用方法请参考ModernBERT仓库：https://github.com/AnswerDotAI/ModernBERT ### 直接访问 python from streaming import StreamingDataset # 加载流式数据集 dataset = StreamingDataset( remote='https://huggingface.co/datasets/jhu-clsp/ettin-decay-data', local='/tmp/ettin-decay-data', shuffle=True ) # 访问优质样本 for sample in dataset: text = sample['text'] # 高质量长文本内容 # 处理你的数据... ## 📁 数据结构每个文件夹均包含MDS格式的优质数据源： - `arxiv/` - ArXiv学术论文 - `books/` - 文学与参考图书（扩展合集） - `books_2/` - 额外图书合集 - `code_repos/` - 优质代码仓库 - `dclm_dolmino/` - 最高质量的过滤后网页数据 - `math_dolmino/` - 优质数学内容 - `stackexchange_dolmino/` - 顶级质量问答内容 - `stackexchange_dolmino_dup/` - 额外精选问答内容 - `stackexchange_dolmino_dup_2/` - 额外问答合集 - `textbooks/` - 教育类教科书内容 - `textbooks_2/` - 额外教科书合集 - `tulu_flan/` - 指令遵循示例 - `wikipedia/` - 维基百科条目 ## 💡 跨目标训练中的使用场景本衰减阶段数据还可用于**跨目标训练**实验： - **解码器→编码器**：在本优质数据集上使用掩码语言建模（Masked Language Modeling, MLM）训练解码器模型 - **编码器→解码器**：在本优质数据集上使用因果语言建模（Causal Language Modeling, CLM）训练编码器模型 - **扩展训练**：额外500亿Token用于跨目标训练实验 ## 🔗 相关资源 - **模型**：[Ettin模型套件](https://huggingface.co/jhu-clsp)（参数规模1700万至10亿） - **第一阶段**：[预训练数据](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data)（1.7万亿Token） - **第二阶段**：[训练中期数据](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data)（2500亿Token） - **训练顺序**：[批次级数据顺序](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - **论文**：[ArXiv链接](https://arxiv.org/abs/2507.11412) - **代码**：[GitHub仓库](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) ## 引用 bibtex @misc{weller2025seqvsseqopen, title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2507.11412}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.11412}, }

提供机构：

maas

创建时间：

2025-09-10

搜集汇总

数据集介绍