tartuNLP/fineweb-2-et

Name: tartuNLP/fineweb-2-et
Creator: tartuNLP
Published: 2026-01-09 21:51:31
License: 暂无描述

Hugging Face2026-01-09 更新2025-09-13 收录

下载链接：

https://hf-mirror.com/datasets/tartuNLP/fineweb-2-et

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个包含爱沙尼亚语文本的数据集，是HuggingFaceFW/fineweb-2数据集的爱沙尼亚子集。数据集包含文本、ID、URL、日期、文件路径、语言类型、语言分数、语言脚本、最小哈希簇大小、最常用语言等信息。数据集适用于文本生成任务，并按照ODC BY协议授权。数据集分为训练集和测试集，大小分别为约46亿字节和约1亿字节，共有约963万条训练数据和约2.4万条测试数据。

This is a dataset containing Estonian language text, which is a subset of the HuggingFaceFW/fineweb-2 dataset in Estonian. The dataset includes fields such as text, ID, URL, date, file path, language type, language score, language script, minimum hash cluster size, and most common languages. It is suitable for text generation tasks and is licensed under the ODC BY protocol. The dataset is split into training and test sets, with sizes of approximately 46 billion bytes and 100 million bytes respectively, containing about 9.63 million training examples and about 24,000 test examples.

提供机构：

tartuNLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集