FineWeb-2 refined pretraining datasets

Name: FineWeb-2 refined pretraining datasets
Creator: 洛桑联邦理工学院计算机与通信科学学院
Published: 2025-02-15 02:42:07
License: 暂无描述

arXiv2025-02-15 更新2025-02-27 收录

下载链接：

https://huggingface.co/epfml

下载链接

链接失效反馈

官方服务：

资源简介：

本研究提出了一个模型基础的数据选择框架，以增强多语言大型语言模型(LLM)的预训练。该框架跨越不同的语言家族、脚本和资源可用性，专注于识别结构化和知识丰富的样本。文章中使用了FastText和Transformer的多层感知器(MLP)嵌入式分类模型对FineWeb-2网络抓取数据集进行了综合的消融研究，以展示方法的有效性。最终，研究者释放了涵盖20种语言 refined预训练数据集，以推动多语言语言模型的发展。

This study presents a model-based data selection framework for enhancing the pre-training of multilingual large language models (LLMs). This framework covers diverse language families, writing systems, and resource availability contexts, with a core focus on identifying structured and knowledge-rich samples. Comprehensive ablation studies were performed on the FineWeb-2 web-crawled dataset using embedded classification models integrating FastText, Transformer architectures, and multi-layer perceptrons (MLPs) to validate the efficacy of the proposed framework. Finally, the research team released a refined pre-training dataset spanning 20 languages to facilitate the advancement of multilingual language models.

提供机构：

洛桑联邦理工学院计算机与通信科学学院

创建时间：

2025-02-15

搜集汇总

数据集介绍

构建方式

FineWeb-2 refined pretraining datasets were created through a model-based filtering framework, which aims to identify structured and knowledge-rich samples from multilingual datasets. This approach leverages Transformer- and FastText-based classifiers, ensuring broad accessibility. The framework was tested on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate its effectiveness. The method was found to match the baseline MMLU score with as little as 15% of the training tokens, improving across other benchmarks.

特点

The FineWeb-2 refined pretraining datasets are characterized by their focus on structured and knowledge-rich data samples, which are identified using a model-based filtering framework. This method is transparent, simple, and efficient, and it can be applied across diverse language families and scripts. The datasets are designed to enhance the performance of large language models (LLMs) by providing high-quality pretraining data.

使用方法

To use the FineWeb-2 refined pretraining datasets, researchers and developers can download the datasets from the provided links. These datasets can then be used to pretrain LLMs, which can be fine-tuned for various applications. The datasets are compatible with popular deep learning frameworks and can be easily integrated into existing LLM training pipelines. Researchers can also use the provided codebase to implement the model-based filtering framework on their own datasets.

背景与挑战

背景概述

随着大型语言模型（LLM）在性能上的显著提升，数据集的整理已成为其性能提升的基础。然而，目前基于模型的过滤技术主要集中在上，而对于非英语语言的模型过滤技术的研究相对较少。为了解决这一不平衡，研究人员提出了一种基于模型的过滤框架，旨在从多语言数据集中识别出一组多样化和知识丰富的样本。该框架利用了Transformer和FastText分类器，以确保技术的广泛可访问性。通过对FineWeb-2网络爬虫数据集的全面消融研究，证明了该方法的有效性。训练一个1B参数的Llama模型，我们的方法可以在仅使用15%的训练标记的情况下匹配基线MMLU分数，同时也在其他基准测试中有所提高。这些发现为我们的方法在其他语言上的泛化能力提供了强有力的证据。因此，我们将我们的框架扩展到了20种语言，并发布了相应的精炼预训练数据集。

当前挑战

当前数据集面临的挑战包括：1)所解决的领域问题，即如何有效地从多语言数据集中识别出多样化和知识丰富的样本，以增强多语言预训练数据集；2)构建过程中所遇到的挑战，包括如何平衡数据质量和数据规模，如何选择合适的训练数据集，以及如何避免数据污染等问题。

常用场景

经典使用场景

该数据集主要用于提升多语言大型语言模型（LLM）的预训练性能。通过基于模型的筛选框架，FineWeb-2 refined pretraining datasets能够在多样化的语言家族、脚本和资源可用性方面，识别出结构化和知识丰富的样本。这种方法强调了透明度、简洁性和效率，利用基于Transformer和FastText的分类器，确保了技术的广泛可访问性和数据的质量。

解决学术问题

FineWeb-2 refined pretraining datasets解决了多语言LLM预训练中数据质量筛选的问题。传统的基于规则的数据筛选方法主要针对英语数据集，而基于模型的数据筛选技术则主要集中在英语数据上。FineWeb-2 refined pretraining datasets提出了一个基于模型的筛选框架，旨在解决非英语语言研究中存在的差距，通过模型筛选出高质量、结构化和知识丰富的数据，从而提高了LLM的性能。此外，该数据集还通过实验验证了其方法在不同语言上的泛化能力，并提供了经过筛选的预训练数据集，推动了多语言语言建模的发展。

衍生相关工作

FineWeb-2 refined pretraining datasets衍生了许多相关的经典工作。例如，FineWeb-Edu数据集采用了基于模型的筛选方法，利用LLM进行质量评估，进一步提高了LLM的性能。DCLM数据集则采用了FastText分类器进行数据筛选，实现了高效且具有竞争力的性能。此外，FineWeb-2 refined pretraining datasets还推动了多语言语言模型预训练的研究，为构建高质量、多语言的语言模型提供了重要的数据基础。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集