HuggingFaceFW/finetranslations-edu
收藏Hugging Face2026-01-09 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/HuggingFaceFW/finetranslations-edu
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含超过1万亿个英语与500多种语言的平行文本标记。它是通过使用Gemma3 27B模型将来自FineWeb2数据集的数据翻译成英语而获得的。主要目的是提高翻译能力,特别是对于资源较少的语言。Edu版本仅包含基于教育分类器评分的前10%数据。数据集适用于文本生成和翻译任务,采用ODC-By许可。
This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from the FineWeb2 dataset into English using the Gemma3 27B model. The primary motivation is to improve translation capabilities, especially for lower resource languages. The Edu version includes only the top 10% of content based on an educational classifier. The dataset is suitable for tasks like text generation and translation and is available under the ODC-By license.
提供机构:
HuggingFaceFW



