stukenov/sozkz-corpus-segmented-kk-v1

Name: stukenov/sozkz-corpus-segmented-kk-v1
Creator: stukenov
Published: 2026-04-29 18:12:10
License: 暂无描述

Hugging Face2026-04-29 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/stukenov/sozkz-corpus-segmented-kk-v1

下载链接

链接失效反馈

官方服务：

资源简介：

sozkz-corpus-segmented-kk-v1数据集包含55.5M哈萨克语文本，其中词素边界由BiLSTM神经分割器标记，用于训练具有词素感知能力的分词器。数据集使用ASCII Unit Separator（x1F）作为词素边界标记，数据来源于stukenov/ekitil-corpus-annotated-kk-v1，并经过过滤（检测语言为哈萨克语且置信度≥0.95）。分割模型基于QazCorpora数据集训练，采用BIO标记（B-ROOT、I-ROOT、B-SUFFIX、I-SUFFIX），每个字符被分类，B-SUFFIX标记表示新词素的开始。数据集适用于训练词素感知的BPE分词器、哈萨克语的形态学分析和语言学研究。

The sozkz-corpus-segmented-kk-v1 dataset contains 55.5M Kazakh texts with morpheme boundaries marked by a BiLSTM neural segmenter, built for training morpheme-aware tokenizers. The dataset uses the ASCII Unit Separator (x1F) as the morpheme boundary marker. The data is sourced from stukenov/ekitil-corpus-annotated-kk-v1 and filtered (detected language is Kazakh with confidence ≥ 0.95). The segmentation model is trained on the QazCorpora dataset with BIO tagging (B-ROOT, I-ROOT, B-SUFFIX, I-SUFFIX), where each character is classified, and B-SUFFIX tags mark the beginning of a new morpheme. The dataset is intended for training morpheme-aware BPE tokenizers, morphological analysis, and linguistic research on Kazakh.

提供机构：

stukenov

5,000+

优质数据集

54 个

任务类型

进入经典数据集