ettin-extension-data
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/ettin-extension-data
下载链接
链接失效反馈官方服务:
资源简介:
# Ettin Mid-training Data
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2507.11412)
[](https://huggingface.co/jhu-clsp)
[](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
> **Phase 2 of 3**: Higher-quality filtered data with context extension (250B tokens) used for mid-training of Ettin models.
This dataset contains the mid-training phase data used to train all [Ettin encoder and decoder models](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb). This phase focuses on **higher-quality filtered data** and **context length extension to 8K tokens**. The data is provided in **MDS format** ready for use with [Composer](https://github.com/mosaicml/composer) and the [ModernBERT training repository](https://github.com/answerdotai/ModernBERT).
## 📊 Data Composition
| Data Source | Tokens (B) | Percentage | Description |
|:------------|:-----------|:-----------|:------------|
| DCLM (Dolmino) | 175.5 | 70.4% | High-quality filtered web crawl |
| Starcoder | 38.4 | 15.4% | Code repositories and files |
| Math (Dolmino) | 10.4 | 4.2% | Mathematical content (filtered) |
| PeS2o | 8.3 | 3.3% | Scientific papers |
| Reddit | 6.2 | 2.5% | Social discussion threads |
| Arxiv | 4.1 | 1.6% | Academic preprints |
| StackExchange (Dolmino) | 2.7 | 1.1% | Q&A forums (filtered) |
| Tulu Flan | 2.4 | 1.0% | Instruction-following data |
| Books | 0.8 | 0.3% | Literature and reference books |
| Wikipedia | 0.5 | 0.2% | Encyclopedia articles |
| **Total** | **249.3** | **100.0%** | Quality-focused mixture |
## 🔧 Key Changes from Pre-training
### Data Quality Improvements
- **Filtered DCLM**: Using Dolmino-filtered version instead of raw DCLM
- **Enhanced Math**: Dolmino-filtered mathematical content
- **Curated StackExchange**: Higher-quality Q&A content
- **Removed Noisy Sources**: Dropped CC Head, CC News, and general StackExchange
### Technical Improvements
- **Context Extension**: Increased from 1K to 8K token sequences
- **RoPE Updates**: Modified positional encoding for longer context
- **Learning Schedule**: Inverse square root decay from peak LR
## 🚀 Usage
For pre-training see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT
### Direct Access
```python
from streaming import StreamingDataset
# Load the streaming dataset
dataset = StreamingDataset(
remote='https://huggingface.co/datasets/jhu-clsp/ettin-extension-data',
local='/tmp/ettin-extension-data',
shuffle=True
)
# Access samples (note: these will be longer sequences)
for sample in dataset:
text = sample['text'] # Up to 8K tokens
# Process your data...
```
## 📁 Structure
Each folder contains filtered, higher-quality data sources in MDS format:
- `arxiv/` - Academic papers from ArXiv
- `books/` - Literature and reference books
- `dclm_dolmino/` - Dolmino-filtered web crawl data (primary source)
- `math_dolmino/` - Filtered mathematical content
- `pes2o/` - Scientific papers
- `reddit/` - Reddit discussion threads
- `stackexchange_dolmino/` - Filtered StackExchange Q&A
- `starcoder/` - Code from GitHub repositories
- `tulu_flan/` - Instruction-following examples
- `wikipedia/` - Wikipedia articles
## 🔗 Related Resources
- **Models**: [Ettin Model Suite](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb) (17M-1B parameters)
- **Phase 1**: [Pre-training Data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) (1.7T tokens)
- **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) (50B tokens)
- **Training Order**: [Batch-level Data Order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order)
- **Paper**: [Arxiv link](https://arxiv.org/abs/2507.11412)
- **Code**: [GitHub Repository](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
## Citation
```bibtex
@misc{weller2025seqvsseqopen,
title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders},
author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2507.11412},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11412},
}
```
# Ettin 中期训练数据集
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2507.11412)
[](https://huggingface.co/jhu-clsp)
[](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
> **三阶段中的第二阶段**:用于Ettin模型中期训练的高质量过滤数据(含上下文扩展,总计2500亿Token)。
本数据集为训练所有[Ettin编码器与解码器模型](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb)所用的中期训练阶段数据。本阶段聚焦于**高质量过滤数据**与**上下文长度扩展至8000Token**。数据以**MDS格式(MDS Format)**提供,可直接配合[Composer](https://github.com/mosaicml/composer)与[ModernBERT训练仓库](https://github.com/answerdotai/ModernBERT)使用。
## 📊 数据构成
| 数据来源 | Token数(十亿) | 占比 | 描述 |
|:-----------------|:---------------|:-------|:-------------------------|
| DCLM(Dolmino) | 175.5 | 70.4% | 高质量过滤网络爬虫数据 |
| Starcoder | 38.4 | 15.4% | 代码仓库与文件 |
| 数学(Dolmino) | 10.4 | 4.2% | 经过滤的数学内容 |
| PeS2o | 8.3 | 3.3% | 科学论文 |
| Reddit | 6.2 | 2.5% | 社交讨论线程 |
| ArXiv | 4.1 | 1.6% | 学术预印本 |
| StackExchange(Dolmino) | 2.7 | 1.1% | 经过滤的问答论坛 |
| Tulu Flan | 2.4 | 1.0% | 指令遵循数据 |
| 图书 | 0.8 | 0.3% | 文学与参考书籍 |
| 维基百科 | 0.5 | 0.2% | 百科全书文章 |
| **总计** | **249.3** | **100.0%** | 以质量为核心的混合数据集 |
## 🔧 预训练阶段的关键变更
### 数据质量优化
- **过滤后的DCLM**:采用Dolmino过滤版本替代原始DCLM数据
- **强化数学数据**:经Dolmino过滤的数学内容
- **精选StackExchange数据**:更高质量的问答内容
- **移除噪声数据源**:删除CC Head、CC News及通用StackExchange数据
### 技术优化
- **上下文扩展**:将序列长度从1000Token提升至8000Token
- **RoPE更新**:针对更长上下文修改位置编码
- **学习率调度**:从峰值学习率开始采用逆平方根衰减策略
## 🚀 使用方式
预训练相关内容请参阅ModernBERT仓库:https://github.com/AnswerDotAI/ModernBERT
### 直接访问
python
from streaming import StreamingDataset
# 加载流式数据集
dataset = StreamingDataset(
remote='https://huggingface.co/datasets/jhu-clsp/ettin-extension-data',
local='/tmp/ettin-extension-data',
shuffle=True
)
# 访问样本(注意:样本序列长度更长)
for sample in dataset:
text = sample['text'] # 最长可达8000个Token
# 处理你的数据...
## 📁 数据集结构
每个文件夹均包含经过滤的高质量数据源,格式为MDS格式(MDS Format):
- `arxiv/` - ArXiv学术论文
- `books/` - 文学与参考图书
- `dclm_dolmino/` - 经Dolmino过滤的网络爬虫数据(核心数据源)
- `math_dolmino/` - 经过滤的数学内容
- `pes2o/` - 科学论文
- `reddit/` - Reddit讨论线程
- `stackexchange_dolmino/` - 经过滤的StackExchange问答数据
- `starcoder/` - GitHub仓库代码
- `tulu_flan/` - 指令遵循示例数据
- `wikipedia/` - 维基百科文章
## 🔗 相关资源
- **模型**:[Ettin模型套件](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb)(参数规模1700万至10亿)
- **第一阶段**:[预训练数据集](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data)(1.7万亿Token)
- **第三阶段**:[衰减阶段数据集](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data)(500亿Token)
- **训练顺序**:[批次级数据顺序](https://huggingface.co/datasets/jhu-clsp/ettin-data-order)
- **论文**:[ArXiv链接](https://arxiv.org/abs/2507.11412)
- **代码**:[GitHub仓库](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
## 引用
bibtex
@misc{weller2025seqvsseqopen,
title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders},
author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2507.11412},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11412},
}
提供机构:
maas
创建时间:
2025-09-11



