FineWeb-HQ
收藏魔搭社区2025-12-05 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/epfml/FineWeb-HQ
下载链接
链接失效反馈官方服务:
资源简介:
# FineWeb-HQ
## Dataset Summary
**FineWeb-HQ** is a **high-quality, model-filtered pretraining dataset** derived as a subset of [**FineWeb**](https://huggingface.co/datasets/HuggingFaceFW/fineweb). FineWeb-HQ was created by selecting the **top 10% of FineWeb documents** based on a deep learning classifier trained to identify **structured and knowledge-rich samples**. This classifier uses **XLM-RoBERTa embeddings** to score documents.
To validate our approach, we pretrained **1B-parameter LLM models** with a Llama-like architecture across multiple languages and scripts. The results showed **improvements on standard English benchmarks**, with our dataset outperforming its English counterparts [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet) and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). For its multilingual version, **FineWeb2-HQ**, evaluations on **CMMLU (Chinese), MMLU (German), and MMLU (French)** demonstrated that it **matches FineWeb2's performance while trained on 6x fewer tokens** and **surpasses it when fully trained**.
| Dataset | Ours | DCLM | FW-Edu | FW |
| :--- | :---: | :---: | :---: | :---: |
| **Average Rank** | 1.8333 | 2.3889 | 2.4444 | 3.3333 |
| ARC (Challenge) | 0.3550 | 0.3530 | **0.3850** | 0.3010 |
| ARC (Easy) | 0.6670 | 0.6470 | **0.6970** | 0.5880 |
| CommonsenseQA | 0.3870 | **0.4100** | 0.3770 | 0.3850 |
| HellaSwag | **0.6040** | 0.5960 | 0.5700 | 0.5930 |
| MMLU | 0.3400 | 0.3160 | **0.3470** | 0.3030 |
| OpenBookQA | 0.3860 | 0.3840 | **0.4180** | 0.3560 |
| PIQA | 0.7510 | 0.7510 | 0.7410 | **0.7620** |
| WinoGrande | **0.5720** | 0.5610 | 0.5660 | 0.5550 |
| TriviaQA | 0.0820 | **0.1240** | 0.0320 | 0.0370 |
For more details, see our paper [Enhancing Multilingual LLM Pretraining with Model-Based Data Selection](https://arxiv.org/abs/2502.10361).
## Key features
- **High-quality selection**: Top 10% of FineWeb documents by quality
- **Multilingual version**: [FineWeb2-HQ](https://huggingface.co/datasets/epfml/FineWeb2-HQ)
- **Model-based filtering**: Uses an XLM-RoBERTa embedding-based classifier to score documents
- **Enhanced benchmark performance**: Surpasses FineWeb benchmark performance and competitive to [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet) and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
- **Fully open**: Emphasis on transparency
## Dataset structure
### Data fields
Each data entry includes the original [FineWeb data fields](https://huggingface.co/datasets/HuggingFaceFW/fineweb#data-fields) with the addition of:
- `quality_score`: quality score obtained by the quality classifier
- `embeddings`: array of float arrays containing 768-dimensional XLM-RoBERTa embeddings for every 512 token chunk of the tokenized text
## Licensing information
Like FineWeb, this dataset is released under [Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1-0/) license and is subject to [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use).
## Dataset origin
Being a subset of FineWeb (v1.3.0), this data covers websites over the 2013-2024 time period.
FineWeb is sourced from the internet at large, it is very likely that some personable identifiable information (PII) will be present, even if the FineWeb processing has already anonymized email addresses and public IP addresses. If you find your own PII and would like it removed, please fill out the [FineWeb PII removal/opt out form](https://forms.gle/VyNT3ZAUPZjPuWp39).
## Considerations for Using the Data
Before using this dataset for training models, we recommend performing additional filtering for sensitive content such as PII or harmful content.
For the aspects of social impact, discussion of biases, and known limitations, we also refer to the [FineWeb documentation](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
## Citation information
This work has been **Accepted to the Benchmarks and Datasets Track at the Thirty-Ninth Conference on Neural Information Processing Systems (NeurIPS 2025)**.
Until the final proceedings are published, please use the following temporary citation (which links to the public preprint):
```
@article
{messmer2025multilingdatacomp,
title={Enhancing Multilingual LLM Pretraining with Model-Based Data Selection},
author={Bettina Messmer and Vinko Sabolčec and Martin Jaggi},
journal={arXiv},
year={2025},
url={https://arxiv.org/abs/2502.10361},
}
```
# FineWeb-HQ
## 数据集概述
**FineWeb-HQ** 是源自[**FineWeb**](https://huggingface.co/datasets/HuggingFaceFW/fineweb)子集的高质量、经模型筛选的预训练数据集。FineWeb-HQ通过基于深度学习分类器选取FineWeb文档中排名前10%的样本构建而成,该分类器旨在识别结构化且知识丰富的样本,其采用**XLM-RoBERTa嵌入(XLM-RoBERTa embeddings)**为文档打分。
为验证该方法的有效性,我们基于类Llama架构,使用该数据集在多语言多脚本场景下预训练了10亿参数的大语言模型(Large Language Model, LLM)。实验结果显示,其在标准英语基准测试中取得了性能提升,且该数据集的表现优于同类型英文数据集[DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet)与[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)。针对其多语言版本**FineWeb2-HQ**,在CMMLU(中文)、MMLU(德语)与MMLU(法语)上的评估结果表明,当训练令牌数仅为FineWeb2的1/6时,其性能与FineWeb2持平;而在完成全量训练后,其性能超越了FineWeb2。
| 数据集 | 本方法 | DCLM | FW-Edu | FW |
| :--- | :---: | :---: | :---: | :---: |
| **平均排名** | 1.8333 | 2.3889 | 2.4444 | 3.3333 |
| ARC(挑战集) | 0.3550 | 0.3530 | **0.3850** | 0.3010 |
| ARC(简易集) | 0.6670 | 0.6470 | **0.6970** | 0.5880 |
| CommonsenseQA | 0.3870 | **0.4100** | 0.3770 | 0.3850 |
| HellaSwag | **0.6040** | 0.5960 | 0.5700 | 0.5930 |
| MMLU | 0.3400 | 0.3160 | **0.3470** | 0.3030 |
| OpenBookQA | 0.3860 | 0.3840 | **0.4180** | 0.3560 |
| PIQA | 0.7510 | 0.7510 | 0.7410 | **0.7620** |
| WinoGrande | **0.5720** | 0.5610 | 0.5660 | 0.5550 |
| TriviaQA | 0.0820 | **0.1240** | 0.0320 | 0.0370 |
如需了解更多细节,请参阅我们的论文《Enhancing Multilingual LLM Pretraining with Model-Based Data Selection》,链接为https://arxiv.org/abs/2502.10361。
## 核心特性
- **高质量筛选**:基于质量评分选取FineWeb文档中排名前10%的样本
- **多语言版本**:[FineWeb2-HQ](https://huggingface.co/datasets/epfml/FineWeb2-HQ)
- **基于模型的筛选**:采用基于XLM-RoBERTa嵌入的分类器为文档打分
- **基准性能提升**:优于FineWeb的基准性能,可与[DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet)及[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)相媲美
- **完全开源**:注重透明度与可复现性
## 数据集结构
### 数据字段
每条数据条目包含原始[FineWeb数据字段](https://huggingface.co/datasets/HuggingFaceFW/fineweb#data-fields),并新增以下字段:
- `quality_score`:质量分类器给出的质量评分
- `embeddings`:浮点数组集合,包含针对分词后文本每512个Token(令牌)块的768维XLM-RoBERTa嵌入向量
## 授权信息
与FineWeb一致,本数据集采用[开放数据共同体署名许可协议v1.0(Open Data Commons Attribution License (ODC-By) v1.0)](https://opendatacommons.org/licenses/by/1-0/)发布,并受[CommonCrawl使用条款](https://commoncrawl.org/terms-of-use)约束。
## 数据集来源
作为FineWeb(v1.3.0)的子集,本数据集涵盖2013年至2024年期间的网站数据。
FineWeb的数据源为公开互联网,尽管FineWeb的预处理流程已对电子邮件地址与公共IP地址进行了匿名化处理,但仍有可能包含部分个人可识别信息(PII)。若您发现自身的PII信息并希望将其移除,请填写[FineWeb PII移除/退出表单](https://forms.gle/VyNT3ZAUPZjPuWp39)。
## 数据使用注意事项
在使用本数据集训练模型前,我们建议您针对敏感内容(如PII或有害内容)进行额外筛选。
关于社会影响、偏见讨论及已知局限性,我们建议您参考[FineWeb官方文档](https://huggingface.co/datasets/HuggingFaceFW/fineweb)。
## 引用信息
本工作已被第三十九届神经信息处理系统大会(NeurIPS 2025)的基准与数据集赛道收录。
在最终会议论文集发布前,请使用以下临时引用格式(链接至公开预印本):
@article
{messmer2025multilingdatacomp,
title={Enhancing Multilingual LLM Pretraining with Model-Based Data Selection},
author={Bettina Messmer and Vinko Sabolčec and Martin Jaggi},
journal={arXiv},
year={2025},
url={https://arxiv.org/abs/2502.10361},
}
提供机构:
maas
创建时间:
2025-09-22



