hotchpotch/fineweb-2-edu-japanese-noise-detect-raw
收藏Hugging Face2025-02-20 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/fineweb-2-edu-japanese-noise-detect-raw
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含经过Unicode正規化处理的日本语文本数据的数据集,用于教育目的。数据集通过fineweb-2-japanese-text-cleaner对原始数据进行噪声推断,并去除了噪声文本。数据集分为训练集和测试集,提供了文本内容、噪声位置、唯一标识符等特征。适用于日本语文本处理的任务。
This dataset contains Japanese text data that has been normalized through Unicode NFKC. It is designed for educational purposes and utilizes fineweb-2-japanese-text-cleaner to infer noise in the original data, from which the noisy text has been removed. The dataset is split into training and test sets and provides features such as text content, noise spans, and unique identifiers, suitable for Japanese text processing tasks.
提供机构:
hotchpotch



