five

stukenov/sozkz-corpus-segmented-kk-v1

收藏
Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/stukenov/sozkz-corpus-segmented-kk-v1
下载链接
链接失效反馈
官方服务:
资源简介:
sozkz-corpus-segmented-kk-v1数据集包含55.5M哈萨克语文本,其中词素边界由BiLSTM神经分割器标记,用于训练具有词素感知能力的分词器。数据集使用ASCII Unit Separator(x1F)作为词素边界标记,数据来源于stukenov/ekitil-corpus-annotated-kk-v1,并经过过滤(检测语言为哈萨克语且置信度≥0.95)。分割模型基于QazCorpora数据集训练,采用BIO标记(B-ROOT、I-ROOT、B-SUFFIX、I-SUFFIX),每个字符被分类,B-SUFFIX标记表示新词素的开始。数据集适用于训练词素感知的BPE分词器、哈萨克语的形态学分析和语言学研究。

The sozkz-corpus-segmented-kk-v1 dataset contains 55.5M Kazakh texts with morpheme boundaries marked by a BiLSTM neural segmenter, built for training morpheme-aware tokenizers. The dataset uses the ASCII Unit Separator (x1F) as the morpheme boundary marker. The data is sourced from stukenov/ekitil-corpus-annotated-kk-v1 and filtered (detected language is Kazakh with confidence ≥ 0.95). The segmentation model is trained on the QazCorpora dataset with BIO tagging (B-ROOT, I-ROOT, B-SUFFIX, I-SUFFIX), where each character is classified, and B-SUFFIX tags mark the beginning of a new morpheme. The dataset is intended for training morpheme-aware BPE tokenizers, morphological analysis, and linguistic research on Kazakh.
提供机构:
stukenov
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作