five

ettin-extension-data

收藏
魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/ettin-extension-data
下载链接
链接失效反馈
官方服务:
资源简介:
# Ettin Mid-training Data [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2507.11412) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-12%20Models-blue)](https://huggingface.co/jhu-clsp) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) > **Phase 2 of 3**: Higher-quality filtered data with context extension (250B tokens) used for mid-training of Ettin models. This dataset contains the mid-training phase data used to train all [Ettin encoder and decoder models](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb). This phase focuses on **higher-quality filtered data** and **context length extension to 8K tokens**. The data is provided in **MDS format** ready for use with [Composer](https://github.com/mosaicml/composer) and the [ModernBERT training repository](https://github.com/answerdotai/ModernBERT). ## 📊 Data Composition | Data Source | Tokens (B) | Percentage | Description | |:------------|:-----------|:-----------|:------------| | DCLM (Dolmino) | 175.5 | 70.4% | High-quality filtered web crawl | | Starcoder | 38.4 | 15.4% | Code repositories and files | | Math (Dolmino) | 10.4 | 4.2% | Mathematical content (filtered) | | PeS2o | 8.3 | 3.3% | Scientific papers | | Reddit | 6.2 | 2.5% | Social discussion threads | | Arxiv | 4.1 | 1.6% | Academic preprints | | StackExchange (Dolmino) | 2.7 | 1.1% | Q&A forums (filtered) | | Tulu Flan | 2.4 | 1.0% | Instruction-following data | | Books | 0.8 | 0.3% | Literature and reference books | | Wikipedia | 0.5 | 0.2% | Encyclopedia articles | | **Total** | **249.3** | **100.0%** | Quality-focused mixture | ## 🔧 Key Changes from Pre-training ### Data Quality Improvements - **Filtered DCLM**: Using Dolmino-filtered version instead of raw DCLM - **Enhanced Math**: Dolmino-filtered mathematical content - **Curated StackExchange**: Higher-quality Q&A content - **Removed Noisy Sources**: Dropped CC Head, CC News, and general StackExchange ### Technical Improvements - **Context Extension**: Increased from 1K to 8K token sequences - **RoPE Updates**: Modified positional encoding for longer context - **Learning Schedule**: Inverse square root decay from peak LR ## 🚀 Usage For pre-training see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT ### Direct Access ```python from streaming import StreamingDataset # Load the streaming dataset dataset = StreamingDataset( remote='https://huggingface.co/datasets/jhu-clsp/ettin-extension-data', local='/tmp/ettin-extension-data', shuffle=True ) # Access samples (note: these will be longer sequences) for sample in dataset: text = sample['text'] # Up to 8K tokens # Process your data... ``` ## 📁 Structure Each folder contains filtered, higher-quality data sources in MDS format: - `arxiv/` - Academic papers from ArXiv - `books/` - Literature and reference books - `dclm_dolmino/` - Dolmino-filtered web crawl data (primary source) - `math_dolmino/` - Filtered mathematical content - `pes2o/` - Scientific papers - `reddit/` - Reddit discussion threads - `stackexchange_dolmino/` - Filtered StackExchange Q&A - `starcoder/` - Code from GitHub repositories - `tulu_flan/` - Instruction-following examples - `wikipedia/` - Wikipedia articles ## 🔗 Related Resources - **Models**: [Ettin Model Suite](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb) (17M-1B parameters) - **Phase 1**: [Pre-training Data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) (1.7T tokens) - **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) (50B tokens) - **Training Order**: [Batch-level Data Order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - **Paper**: [Arxiv link](https://arxiv.org/abs/2507.11412) - **Code**: [GitHub Repository](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) ## Citation ```bibtex @misc{weller2025seqvsseqopen, title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2507.11412}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.11412}, } ```

# Ettin 中期训练数据集 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2507.11412) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-12%20Models-blue)](https://huggingface.co/jhu-clsp) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) > **三阶段中的第二阶段**:用于Ettin模型中期训练的高质量过滤数据(含上下文扩展,总计2500亿Token)。 本数据集为训练所有[Ettin编码器与解码器模型](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb)所用的中期训练阶段数据。本阶段聚焦于**高质量过滤数据**与**上下文长度扩展至8000Token**。数据以**MDS格式(MDS Format)**提供,可直接配合[Composer](https://github.com/mosaicml/composer)与[ModernBERT训练仓库](https://github.com/answerdotai/ModernBERT)使用。 ## 📊 数据构成 | 数据来源 | Token数(十亿) | 占比 | 描述 | |:-----------------|:---------------|:-------|:-------------------------| | DCLM(Dolmino) | 175.5 | 70.4% | 高质量过滤网络爬虫数据 | | Starcoder | 38.4 | 15.4% | 代码仓库与文件 | | 数学(Dolmino) | 10.4 | 4.2% | 经过滤的数学内容 | | PeS2o | 8.3 | 3.3% | 科学论文 | | Reddit | 6.2 | 2.5% | 社交讨论线程 | | ArXiv | 4.1 | 1.6% | 学术预印本 | | StackExchange(Dolmino) | 2.7 | 1.1% | 经过滤的问答论坛 | | Tulu Flan | 2.4 | 1.0% | 指令遵循数据 | | 图书 | 0.8 | 0.3% | 文学与参考书籍 | | 维基百科 | 0.5 | 0.2% | 百科全书文章 | | **总计** | **249.3** | **100.0%** | 以质量为核心的混合数据集 | ## 🔧 预训练阶段的关键变更 ### 数据质量优化 - **过滤后的DCLM**:采用Dolmino过滤版本替代原始DCLM数据 - **强化数学数据**:经Dolmino过滤的数学内容 - **精选StackExchange数据**:更高质量的问答内容 - **移除噪声数据源**:删除CC Head、CC News及通用StackExchange数据 ### 技术优化 - **上下文扩展**:将序列长度从1000Token提升至8000Token - **RoPE更新**:针对更长上下文修改位置编码 - **学习率调度**:从峰值学习率开始采用逆平方根衰减策略 ## 🚀 使用方式 预训练相关内容请参阅ModernBERT仓库:https://github.com/AnswerDotAI/ModernBERT ### 直接访问 python from streaming import StreamingDataset # 加载流式数据集 dataset = StreamingDataset( remote='https://huggingface.co/datasets/jhu-clsp/ettin-extension-data', local='/tmp/ettin-extension-data', shuffle=True ) # 访问样本(注意:样本序列长度更长) for sample in dataset: text = sample['text'] # 最长可达8000个Token # 处理你的数据... ## 📁 数据集结构 每个文件夹均包含经过滤的高质量数据源,格式为MDS格式(MDS Format): - `arxiv/` - ArXiv学术论文 - `books/` - 文学与参考图书 - `dclm_dolmino/` - 经Dolmino过滤的网络爬虫数据(核心数据源) - `math_dolmino/` - 经过滤的数学内容 - `pes2o/` - 科学论文 - `reddit/` - Reddit讨论线程 - `stackexchange_dolmino/` - 经过滤的StackExchange问答数据 - `starcoder/` - GitHub仓库代码 - `tulu_flan/` - 指令遵循示例数据 - `wikipedia/` - 维基百科文章 ## 🔗 相关资源 - **模型**:[Ettin模型套件](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb)(参数规模1700万至10亿) - **第一阶段**:[预训练数据集](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data)(1.7万亿Token) - **第三阶段**:[衰减阶段数据集](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data)(500亿Token) - **训练顺序**:[批次级数据顺序](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - **论文**:[ArXiv链接](https://arxiv.org/abs/2507.11412) - **代码**:[GitHub仓库](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) ## 引用 bibtex @misc{weller2025seqvsseqopen, title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2507.11412}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.11412}, }
提供机构:
maas
创建时间:
2025-09-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作