five

Gulnur7/kazakh-lexical-complexity-classes

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Gulnur7/kazakh-lexical-complexity-classes
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - kk license: cc-by-4.0 task_categories: - text-classification task_ids: - multi-class-classification tags: - cefr - lexical-complexity - kazakh - turkic - morphology - language-learning size_categories: - 1K<n<10K pretty_name: Kazakh Lexical Complexity Classes dataset_info: features: - name: lemma dtype: string - name: pos dtype: string - name: cefr dtype: string splits: - name: full num_examples: 4561 configs: - config_name: default data_files: - split: full path: kazakh_cefr_lexicon.json --- # Kazakh Lexical Complexity Classes A CEFR-graded lexical resource for the Kazakh language. The lexicon contains **4,561** lemma–POS entries graded across five CEFR proficiency levels. ## Data Format The dataset is provided as a single JSON file. Each entry has the following fields: | Field | Type | Description | |---------|--------|--------------------------------------------------| | `lemma` | string | Kazakh word (Cyrillic script) | | `pos` | string | Part of speech (NOUN, VERB, ADJ, ADV, NUM, PRON, OTHER, etc.) | | `cefr` | string | CEFR proficiency level (A1, A2, B1, B2, C1) | ### Example entries ```json [ {"lemma": "су", "pos": "NOUN", "cefr": "A1"}, {"lemma": "бару", "pos": "VERB", "cefr": "A1"}, {"lemma": "байланыс", "pos": "NOUN", "cefr": "B1"}, {"lemma": "жаһандану", "pos": "NOUN", "cefr": "C1"} ] ``` ## Distribution ### By CEFR Level | Level | Count | |-------|------:| | A1 | 962 | | A2 | 697 | | B1 | 890 | | B2 | 891 | | C1 | 1,121 | | **Total** | **4,561** | ### By Part of Speech | POS | Count | |-------|------:| | NOUN | 1,984 | | VERB | 811 | | OTHER | 759 | | ADJ | 495 | | ADV | 193 | | NUM | 91 | | PRON | 76 | | ADP | 74 | | MODAL | 39 | | INTJ | 22 | | AUX | 17 | ## Usage ```python from datasets import load_dataset dataset = load_dataset("Gulnur7/kazakh-lexical-complexity-classes") ``` ## Citation Released later ## License This dataset is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
提供机构:
Gulnur7
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作