five

FineWeb2-HQ

收藏
魔搭社区2025-12-05 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/epfml/FineWeb2-HQ
下载链接
链接失效反馈
官方服务:
资源简介:
# FineWeb2-HQ ## Dataset summary FineWeb2-HQ is a **high-quality, model-filtered pretraining dataset** derived as a subset of [**FineWeb2**](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2), spanning **20 languages**. It enables around 6x faster pretraining compared to the base dataset. FineWeb2-HQ was created by selecting the **top 10% quality documents of FineWeb2** in each language, based on scores assigned by a deep learning classifier trained to identify **structured and knowledge-rich samples** using [**XLM-RoBERTa**](https://huggingface.co/FacebookAI/xlm-roberta-base) **embeddings**. <center> <img src="https://huggingface.co/datasets/epfml/FineWeb2-HQ/raw/main/agg_score_plot.svg" style="width: 70%;" /> </center> Validation was performed by pretraining **1B-parameter LLM models** (llama-like architecture) across multiple languages and writing systems (scripts). Evaluations on **CMMLU (Chinese) and MMLU (German & French)** demonstrate that **FineWeb2-HQ matches FineWeb2 performance when trained with 6x fewer tokens, and outperforms it when fully trained**. Additionally, **improvements were observed across other benchmarks**, such as outperforming its English cousins [DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet) and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). For more details, see our paper [Enhancing Multilingual LLM Pretraining with Model-Based Data Selection](https://arxiv.org/abs/2502.10361). ## Key features - **High-quality selection**: Top 10% of FineWeb2 documents by quality - **Multilingual coverage**: 20 languages, ensuring diverse linguistic representation - **Model-based filtering**: Uses an XLM-RoBERTa embedding-based classifier to score documents - **Enhanced benchmark performance**: Surpasses FineWeb2 benchmark performance - **Fully open**: Emphasis on transparency ## Languages and subsets |Subset name|Language name|Number of documents|Disk size| |----------|-----------------|------------:|----------:| | rus_Cyrl | Russian | 55,220,956 | 1.2T | | cmn_Hani | Chinese | 54,211,986 | 784G | | deu_Latn | German | 43,095,728 | 618G | | spa_Latn | Spanish | 40,057,637 | 515G | | jpn_Jpan | Japanese | 34,185,427 | 393G | | fra_Latn | French | 32,248,772 | 483G | | ita_Latn | Italian | 21,180,304 | 269G | | por_Latn | Portuguese | 18,135,468 | 222G | | pol_Latn | Polish | 13,384,885 | 168G | | nld_Latn | Dutch | 12,920,963 | 160G | | ind_Latn | Indonesian | 8,911,149 | 125G | | tur_Latn | Turkish | 8,578,808 | 100G | | ces_Latn | Czech | 5,995,459 | 104G | | arb_Arab | Arabic | 5,560,599 | 94G | | fas_Arab | Persian | 5,107,187 | 69G | | hun_Latn | Hungarian | 4,527,332 | 79G | | swe_Latn | Swedish | 4,382,454 | 61G | | ell_Grek | Greek | 4,346,440 | 84G | | dan_Latn | Danish | 4,082,751 | 61G | | vie_Latn | Vietnamese | 4,003,956 | 59G | The approach as described in the paper is easy to extend to other languages as well, and we might consider adding new languages to an upcoming version of the present dataset. We also separately release the computed general-purpose embedding vectors for the the full sets of the original FineWeb2 dataset (not just the HQ subsets), in the respective languages, as they can be useful for other applications beyond quality filtering: [FineWeb2-embedded](https://huggingface.co/datasets/epfml/FineWeb2-embedded). ## Dataset structure ### Data fields Each data entry includes the original [FineWeb2 data fields](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2#data-fields) with the addition of: - `quality_score`: quality score obtained by the quality classifier - `embeddings`: array of float arrays containing 768-dimensional XLM-RoBERTa embeddings for every 512 token chunk of the tokenized text ### Data instance ```json { "id": "<urn:uuid:f26003c7-6084-4791-b3fe-240eedc37e76>", "text": "Plutonium ist einer der gefährlichsten Stoffe der Welt. Es entsteht als hochgiftiges und radioaktives Nebenprodukt der Energiegewinnung in Atomkraftwerken. Wer nur ein Millionstel Gramm – ein kaum staubkorngroßes Teilchen – der Substanz einatmet, kann daran sterben. In der Natur kommt der Stoff nur in geringsten Mengen vor, wird aber künstlich hergestellt, weil man damit Bomben bauen kann. Je nach Reinheitsgrad reichen für eine Atombombe bereits fünf Kilogramm. Bis zum Beginn der achtziger Jahre des letzten Jahrhunderts hatten die Reaktoren weltweit bereits rund 300.000 Kilogramm erbrütet. Jährlich kommen etwa 20.000 Kilo hinzu. Genau dieser Stoff wird zu Land und zu Wasser um den ganzen Erdball herum transportiert. Legendär sind die Castor-Transporte, bei denen unter strengsten Sicherheitsvorkehrungen und entsprechenden Kosten abgebrannte Brennelemente aus deutschen Kernkraftwerken zur Wiederaufbereitung nach La Hague (Frankreich) oder Sellafield (Großbritannien) gebracht werden. Erst vergangenen Mai hat ein Frachter die größte Menge wiederaufbereiteten Mülls aller Zeiten von Frankreich nach Japan gebracht. Nicht auszudenken, was ein Unfall auf See bedeuten würde.", "date": "2014-03-16T08:53:38Z", "dump": "CC-MAIN-2014-10", "embeddings": [[ ... ]], "file_path": "s3://commoncrawl/crawl-data/CC-MAIN-2014-10/segments/1394678702159/warc/CC-MAIN-20140313024502-00039-ip-10-183-142-35.ec2.internal.warc.gz", "language": "deu", "language_score": 0.9983288645744324, "language_script": "Latn", "minhash_cluster_size": 2, "top_langs": {"deu_Latn_score": 0.9983288645744324}, "url": "http://www.greenpeace.org/austria/de/themen/atom/probleme/atomtransporte/", "quality_score": 0.06472613662481308 } ``` ## Usage You can load the dataset in Python using `datasets`: ```python from datasets import load_dataset dataset = load_dataset("epfml/FineWeb2-HQ", "deu_Latn") ``` ## Licensing information Like FineWeb2, this dataset is released under [Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1-0/) license and is subject to [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use). ## Dataset origin Being a subset of FineWeb2, this data covers websites over the 2013-2024 time period. FineWeb2 is sourced from the internet at large, it is very likely that some personable identifiable information (PII) will be present, even if the FineWeb2 processing has already anonymized email addresses and public IP addresses. If you find your own PII and would like it removed, please fill out the [FineWeb2 PII removal/opt out form](https://forms.gle/VyNT3ZAUPZjPuWp39). CommonCrawl respects robots.txt at crawl time, but if you are a webmaster and find your website in FineWeb2 and would like to have it removed, you may also use the [FineWeb2 PII removal/opt out form](https://forms.gle/VyNT3ZAUPZjPuWp39). ## Considerations for Using the Data Before using this dataset for training models, we recommend performing additional filtering for sensitive content such as PII or harmful content. For the aspects of social impact, discussion of biases, and known limitations, we also refer to the [FineWeb2 documentation](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). ## Citation information If you use this dataset in your research or applications, please use the following citation: ``` @article{messmer2025multilingdatacomp, title={Enhancing Multilingual LLM Pretraining with Model-Based Data Selection}, author={Bettina Messmer and Vinko Sabolčec and Martin Jaggi}, journal={arXiv}, year={2025}, url={https://arxiv.org/abs/2502.10361}, } ```

# FineWeb2-HQ ## 数据集概述 FineWeb2-HQ是经模型筛选的高质量预训练数据集,作为[FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)的子集构建而来,覆盖20种语言。相比基础数据集,其可将预训练速度提升约6倍。FineWeb2-HQ通过针对每种语言选取FineWeb2中质量排名前10%的文档构建而成,评分由经训练的深度学习分类器生成,该分类器基于[XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base)嵌入来识别结构化且富含知识的样本。 <center> <img src="https://huggingface.co/datasets/epfml/FineWeb2-HQ/raw/main/agg_score_plot.svg" style="width: 70%;" /> </center> 验证环节通过针对多种语言及书写系统(脚本)预训练**10亿参数大语言模型(LLM,类Llama架构)**完成。在**CMMLU(中文)与MMLU(德语、法语)**基准测试中的结果表明:当训练Token数仅为原数据集的1/6时,FineWeb2-HQ可达到与FineWeb2相当的性能;而当训练量拉满时,其性能更胜一筹。此外,在其他基准测试中也观测到性能提升,例如优于其英文同类数据集[DCLM](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet)与[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)。 如需了解更多细节,请参阅我们的论文《基于模型的数据选择增强多语言大语言模型预训练》(Enhancing Multilingual LLM Pretraining with Model-Based Data Selection,https://arxiv.org/abs/2502.10361)。 ## 核心特性 - **高质量筛选**:按质量排名选取FineWeb2中前10%的文档 - **多语言覆盖**:涵盖20种语言,保障语言多样性 - **基于模型的筛选**:使用基于XLM-RoBERTa嵌入的分类器为文档评分 - **基准测试性能提升**:优于FineWeb2的基准测试表现 - **完全开源**:注重透明度 ## 语言与子集 |子集名称|语言名称|文档数量|磁盘占用| |----------|-----------------|------------:|----------:| | rus_Cyrl | 俄语 | 55,220,956 | 1.2T | | cmn_Hani | 中文 | 54,211,986 | 784G | | deu_Latn | 德语 | 43,095,728 | 618G | | spa_Latn | 西班牙语 | 40,057,637 | 515G | | jpn_Jpan | 日语 | 34,185,427 | 393G | | fra_Latn | 法语 | 32,248,772 | 483G | | ita_Latn | 意大利语 | 21,180,304 | 269G | | por_Latn | 葡萄牙语 | 18,135,468 | 222G | | pol_Latn | 波兰语 | 13,384,885 | 168G | | nld_Latn | 荷兰语 | 12,920,963 | 160G | | ind_Latn | 印度尼西亚语 | 8,911,149 | 125G | | tur_Latn | 土耳其语 | 8,578,808 | 100G | | ces_Latn | 捷克语 | 5,995,459 | 104G | | arb_Arab | 阿拉伯语 | 5,560,599 | 94G | | fas_Arab | 波斯语 | 5,107,187 | 69G | | hun_Latn | 匈牙利语 | 4,527,332 | 79G | | swe_Latn | 瑞典语 | 4,382,454 | 61G | | ell_Grek | 希腊语 | 4,346,440 | 84G | | dan_Latn | 丹麦语 | 4,082,751 | 61G | | vie_Latn | 越南语 | 4,003,956 | 59G | 本文所述方法可轻松拓展至其他语言,我们或将考虑在本数据集的后续版本中新增更多语言。 我们还单独发布了原始FineWeb2全量数据集(而非仅高质量子集)各语言版本的通用嵌入向量,其可用于质量筛选之外的其他应用场景:[FineWeb2-embedded](https://huggingface.co/datasets/epfml/FineWeb2-embedded)。 ## 数据集结构 ### 数据字段 每条数据条目包含原始[FineWeb2数据字段](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2#data-fields),并新增以下字段: - `quality_score`:由质量分类器生成的文档质量评分 - `embeddings`:浮点数组集合,对应分词后每512个Token片段的768维XLM-RoBERTa嵌入向量 ### 数据示例 json { "id": "<urn:uuid:f26003c7-6084-4791-b3fe-240eedc37e76>", "text": "Plutonium ist einer der gefährlichsten Stoffe der Welt. Es entsteht als hochgiftiges und radioaktives Nebenprodukt der Energiegewinnung in Atomkraftwerken. Wer nur ein Millionstel Gramm – ein kaum staubkorngroßes Teilchen – der Substanz einatmet, kann daran sterben. In der Natur kommt der Stoff nur in geringsten Mengen vor, wird aber künstlich hergestellt, weil man damit Bomben bauen kann. Je nach Reinheitsgrad reichen für eine Atombombe bereits fünf Kilogramm. Bis zum Beginn der achtziger Jahre des letzten Jahrhunderts hatten die Reaktoren weltweit bereits rund 300.000 Kilogramm erbrütet. Jährlich kommen etwa 20.000 Kilo hinzu. Genau dieser Stoff wird zu Land und zu Wasser um den ganzen Erdball herum transportiert. Legendär sind die Castor-Transporte, bei denen unter strengsten Sicherheitsvorkehrungen und entsprechenden Kosten abgebrannte Brennelemente aus deutschen Kernkraftwerken zur Wiederaufbereitung nach La Hague (Frankreich) oder Sellafield (Großbritannien) gebracht werden. Erst vergangenen Mai hat ein Frachter die größte Menge wiederaufbereiteten Mülls aller Zeiten von Frankreich nach Japan gebracht. Nicht auszudenken, was ein Unfall auf See bedeuten würde.", "date": "2014-03-16T08:53:38Z", "dump": "CC-MAIN-2014-10", "embeddings": [[ ... ]], "file_path": "s3://commoncrawl/crawl-data/CC-MAIN-2014-10/segments/1394678702159/warc/CC-MAIN-20140313024502-00039-ip-10-183-142-35.ec2.internal.warc.gz", "language": "deu", "language_score": 0.9983288645744324, "language_script": "Latn", "minhash_cluster_size": 2, "top_langs": {"deu_Latn_score": 0.9983288645744324}, "url": "http://www.greenpeace.org/austria/de/themen/atom/probleme/atomtransporte/", "quality_score": 0.06472613662481308 } ## 使用方法 可通过Python的`datasets`库加载本数据集: python from datasets import load_dataset dataset = load_dataset("epfml/FineWeb2-HQ", "deu_Latn") ## 许可信息 与FineWeb2一致,本数据集采用[开放数据 Commons 署名许可(ODC-By)v1.0](https://opendatacommons.org/licenses/by/1-0/)协议发布,并受[CommonCrawl使用条款](https://commoncrawl.org/terms-of-use)约束。 ## 数据集来源 作为FineWeb2的子集,本数据集覆盖2013-2024年间的网页内容。FineWeb2源自公开互联网,尽管其处理流程已对电子邮箱与公共IP地址进行匿名化处理,但仍可能包含部分个人可识别信息(PII)。若您发现自身的PII并希望移除,请填写[FineWeb2 PII移除/退出表单](https://forms.gle/VyNT3ZAUPZjPuWp39)。CommonCrawl在爬取时会遵守robots.txt协议,但如果您是网站管理员,发现您的网站被纳入FineWeb2并希望移除,也可使用上述[FineWeb2 PII移除/退出表单](https://forms.gle/VyNT3ZAUPZjPuWp39)。 ## 数据使用注意事项 在使用本数据集训练模型前,我们建议您额外针对敏感内容(如PII或有害内容)进行过滤。关于社会影响、偏见讨论及已知局限性,请参阅[FineWeb2文档](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)。 ## 引用信息 若您在研究或应用中使用本数据集,请使用以下引用格式: @article{messmer2025multilingdatacomp, title={Enhancing Multilingual LLM Pretraining with Model-Based Data Selection}, author={Bettina Messmer and Vinko Sabolčec and Martin Jaggi}, journal={arXiv}, year={2025}, url={https://arxiv.org/abs/2502.10361}, }
提供机构:
maas
创建时间:
2025-09-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作