dclm-edu
收藏魔搭社区2026-01-06 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceTB/dclm-edu
下载链接
链接失效反馈官方服务:
资源简介:
# DCLM-Edu
## Description
This is a filtered version of [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) dataset using FineWeb-Edu educational quality [classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). We annotate each web page based on the educational quality
on a scale from 0 to 5 and only keep samples with a score higher than 2. This dataset is intended for small language models training and was used to train [SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) and [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M).
**_Note:_** As show in the performance section, we find that further filtering the dataset to only keep **samples with `edu_int_score>=3` yields even better downstream performance when training small laguage models**. We include score 2 samples to allow for rebalancing and added diversity, but you can filter the dataset with `datasets` or `datatrove` as shown below.
## How to use
### Using `datasets`
```python
from datasets import load_dataset
fw = load_dataset("HuggingFaceTB/dclm-edu", split="train", streaming=True)
```
### Using 🏭 [`datatrove`](https://github.com/huggingface/datatrove/)
```python
from datatrove.pipeline.readers import ParquetReader
# limit determines how many documents will be streamed (remove for all)
data_reader = ParquetReader("hf://datasets/HuggingFaceTB/dclm-edu", glob_pattern="data/*.parquet", limit=1000)
for document in data_reader():
# do something with document
print(document)
###############################
# OR for a processing pipeline:
###############################
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import ParquetWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceTB/dclm-edu", limit=1000),
LambdaFilter(lambda doc: doc.metadata["edu_int_score"] >= 3),
ParquetWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
```
## Performance
**Results of 360M ablation**
We train a 360M model (using [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) setup) on 200B tokens from DCLM, FineWeb-Edu and DCLM-Edu and evaluate on different benchmarks. DCLM-Edu denotes DCLM samples with an educational score higher than 3.
We find that the model trained on DCLM-Edu performs better on knowledge and reasoning tasks (MMLU & ARC):
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/hOFJRusg6fEEtCpN-RJaP.png" width="700" alt="image">
We invite users to experiment with different data mixing depending on their model size.
**Results of 1.7B ablation:**
We also conducted some ablations at 1.7B scale, we use an intermediate checkpoint of SmolLM2 1.7B (3T tokens) and doing a decay on different subsets of DCLM using the edu filtering with thresholds 2, 3 and 4.
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/ImwiEe712SN5TalxFOeeJ.png" width="700" alt="image">
However we find that the gains from introducing this dataset mid-training during SmolLM2 1.7B training (which was trained on a mix of DCLM and FineWeb-Edu for 6T+ tokens) weren't consistent with the ablation findings, so we only use the dataset for SmolLM2 135M and 360M.
## License
Following DCLM-Baseline, this dataset is licensed under CC-BY-4.0.
## Citation
```bash
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
```
# DCLM-Edu
## 描述
本数据集是[DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0)数据集的过滤版本,采用FineWeb-Edu教育质量**分类器(classifier)**进行筛选。我们按照0至5的评分标尺对每个网页的教育质量进行标注,仅保留评分高于2的样本。本数据集专为小语言模型训练打造,曾用于训练[SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M)与[SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M)。
**注意:** 正如性能章节所示,我们发现若进一步过滤数据集,仅保留**`edu_int_score≥3`的样本**,在训练小语言模型时可获得更优的下游任务性能。我们保留了评分2的样本以实现数据重平衡并增加多样性,但你可按照下文所示,使用`datasets`或`datatrove`对数据集进行过滤。
## 使用方法
### 使用`datasets`库
python
from datasets import load_dataset
fw = load_dataset("HuggingFaceTB/dclm-edu", split="train", streaming=True)
### 使用🏭 [`datatrove`](https://github.com/huggingface/datatrove/) 库
python
from datatrove.pipeline.readers import ParquetReader
# limit determines how many documents will be streamed (remove for all)
data_reader = ParquetReader("hf://datasets/HuggingFaceTB/dclm-edu", glob_pattern="data/*.parquet", limit=1000)
for document in data_reader():
# do something with document
print(document)
###############################
# OR for a processing pipeline:
###############################
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import ParquetWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceTB/dclm-edu", limit=1000),
LambdaFilter(lambda doc: doc.metadata["edu_int_score"] >= 3),
ParquetWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
## 性能结果
### 360M参数模型消融实验结果
我们采用[SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-360M)的训练配置,基于DCLM、FineWeb-Edu与DCLM-Edu的2000亿(200B)Token进行360M参数模型的训练,并在多个基准测试集上进行评估。此处的DCLM-Edu指教育评分高于3的DCLM样本。
我们发现,基于DCLM-Edu训练的模型在知识与推理任务(MMLU与ARC)上表现更优:
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/hOFJRusg6fEEtCpN-RJaP.png" width="700" alt="image">
我们邀请用户根据自身模型的规模,尝试不同的数据混合策略。
### 1.7B参数模型消融实验结果
我们还开展了1.7B参数规模的消融实验,使用SmolLM2 1.7B(已训练3万亿Token)的中间检查点,基于不同阈值(2、3、4)的教育质量过滤结果,对DCLM的不同子集进行衰减训练。
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/ImwiEe712SN5TalxFOeeJ.png" width="700" alt="image">
但我们发现,在SmolLM2 1.7B的训练过程中(该模型基于DCLM与FineWeb-Edu的混合数据训练了6万亿以上的Token),引入本数据集所带来的收益与消融实验结果并不一致,因此本数据集仅用于SmolLM2 135M与360M模型的训练。
## 许可协议
沿用DCLM-Baseline的许可协议,本数据集采用CC-BY-4.0协议授权。
## 引用
bash
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
提供机构:
maas
创建时间:
2025-09-08



