Geonwoohong/aihub-webcorpus-morph-tokenized-ko
收藏Hugging Face2025-10-21 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Geonwoohong/aihub-webcorpus-morph-tokenized-ko
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是基于AIHub韩文网络语料库进行形态分析后得到的韩文网络语料库。每个记录包含形态素级别的标记,并且分为两个子集:承载内容的语义子集和语法/风格上的风格子集。数据集经过清理,去除了HTML标签、URL、电子邮件、提及和标签等,并使用Kiwi进行形态分析,将数据分为语义和风格两个子集。处理后的数据以Apache Arrow shards格式存储,并支持高效的流式传输和加载数据。
This dataset is a Korean web corpus that has been morphologically analyzed and derived from the AIHub Korean Web Corpus. Each record contains morpheme-level tokens and is divided into two subsets: a semantic subset that carries content and a stylistic subset related to grammar and style. The dataset has been cleaned, removing HTML tags, URLs, emails, mentions, and hashtags, and has been morphologically analyzed using Kiwi, separating the data into semantic and stylistic subsets. The processed data is stored in the Apache Arrow shards format and supports efficient streaming and loading of data.
提供机构:
Geonwoohong



