five

uznlp-uz/uz_syllables

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/uznlp-uz/uz_syllables
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: dataset_syllables.tsv language: - uz license: cc-by-4.0 pretty_name: Uzbek Syllable Dataset for Linguistic and Natural Language Processing Research size_categories: - 10K<n<100K tags: - uzbek - syllabification - linguistics - nlp --- # Uzbek Syllable Dataset for Linguistic and Natural Language Processing Research ## Dataset Summary `uz_syllables`, titled *Uzbek Syllable Dataset for Linguistic and Natural Language Processing Research*, is a word-level Uzbek syllabification dataset. Each row pairs an Uzbek word with its syllabified form, where syllable boundaries are marked with `-`. The current TSV file contains: - 21,607 rows - 21,531 unique surface forms - 1 split: `train` - UTF-8 encoded tab-separated values (`.tsv`) This dataset is useful for: - Uzbek syllabification and segmentation tasks - Rule-based or ML-based syllable splitter evaluation - Educational tools for reading and spelling - Lexicon preparation for downstream Uzbek NLP, TTS, or ASR pipelines ## Languages - Uzbek (`uz`) ## Dataset Structure ### Data Instances Example: ```json { "ID": "1", "So‘z": "va’da", "Bo‘g‘inlarga ajratilgan shakli": "va’-da" } ``` Another example: ```json { "ID": "5", "So‘z": "e’tibor", "Bo‘g‘inlarga ajratilgan shakli": "e’ti-bor" } ``` ### Data Fields - `ID`: Row identifier. - `So‘z`: Original Uzbek word form. - `Bo‘g‘inlarga ajratilgan shakli`: Syllabified version of the word. Syllable boundaries are marked with `-`. ### Data Splits | Split | Rows | | --- | ---: | | train | 21,607 | ## Loading the Dataset From Hugging Face: ```python from datasets import load_dataset dataset = load_dataset("uznlp-uz/uz_syllables") print(dataset["train"][0]) ``` From a local TSV file: ```python from datasets import load_dataset dataset = load_dataset( "csv", data_files={"train": "dataset_syllables.tsv"}, delimiter="\t", ) print(dataset["train"][0]) ``` ## Dataset Creation The working source data in this project was maintained in spreadsheet form and exported to TSV for release. Local preprocessing scripts in the project indicate normalization focused on Uzbek apostrophe variants and related text cleanup before export. This release is word-level only and does not include sentence context, phoneme labels, stress markers, or morphological tags. ## Recommended Uses - Training and evaluation of Uzbek syllabification systems - Benchmarking rule-based segmentation algorithms - Building educational resources for Uzbek language learning - Preprocessing support for pronunciation-aware applications ## Limitations - The dataset is limited to isolated words, so it does not model sentence-level pronunciation or prosody. - A small number of repeated surface forms are present in the TSV. - Orthographic normalization choices, especially around apostrophes, may affect exact string matching in downstream systems. ## License This dataset is released under the `CC-BY-4.0` license.
提供机构:
uznlp-uz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作