malhajar/finefrench
收藏Hugging Face2025-06-30 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/malhajar/finefrench
下载链接
链接失效反馈官方服务:
资源简介:
Fine-French是一个经过深度过滤的法国语数据集,它是从FineWeb-2数据集中提取并经过精心清洗得到的。该数据集通过结合人工专家标注和BERT模型自动分类的方法,去除了7500万个低质量网站,提供了大约125亿个高质量token,旨在为法语语言模型提供更优的训练数据。
Fine-French is a deeply filtered French language dataset derived from the FineWeb-2 dataset through meticulous cleaning. The dataset removes more than 75 million low-quality websites using a combination of expert human annotation and automatic classification with a BERT model, providing approximately 12.5 billion high-quality tokens to offer superior training data for French language models.
提供机构:
malhajar



