Gulnur7/kazakh-lexical-complexity-classes
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Gulnur7/kazakh-lexical-complexity-classes
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- kk
license: cc-by-4.0
task_categories:
- text-classification
task_ids:
- multi-class-classification
tags:
- cefr
- lexical-complexity
- kazakh
- turkic
- morphology
- language-learning
size_categories:
- 1K<n<10K
pretty_name: Kazakh Lexical Complexity Classes
dataset_info:
features:
- name: lemma
dtype: string
- name: pos
dtype: string
- name: cefr
dtype: string
splits:
- name: full
num_examples: 4561
configs:
- config_name: default
data_files:
- split: full
path: kazakh_cefr_lexicon.json
---
# Kazakh Lexical Complexity Classes
A CEFR-graded lexical resource for the Kazakh language. The lexicon contains **4,561** lemma–POS entries graded across five CEFR proficiency levels.
## Data Format
The dataset is provided as a single JSON file. Each entry has the following fields:
| Field | Type | Description |
|---------|--------|--------------------------------------------------|
| `lemma` | string | Kazakh word (Cyrillic script) |
| `pos` | string | Part of speech (NOUN, VERB, ADJ, ADV, NUM, PRON, OTHER, etc.) |
| `cefr` | string | CEFR proficiency level (A1, A2, B1, B2, C1) |
### Example entries
```json
[
{"lemma": "су", "pos": "NOUN", "cefr": "A1"},
{"lemma": "бару", "pos": "VERB", "cefr": "A1"},
{"lemma": "байланыс", "pos": "NOUN", "cefr": "B1"},
{"lemma": "жаһандану", "pos": "NOUN", "cefr": "C1"}
]
```
## Distribution
### By CEFR Level
| Level | Count |
|-------|------:|
| A1 | 962 |
| A2 | 697 |
| B1 | 890 |
| B2 | 891 |
| C1 | 1,121 |
| **Total** | **4,561** |
### By Part of Speech
| POS | Count |
|-------|------:|
| NOUN | 1,984 |
| VERB | 811 |
| OTHER | 759 |
| ADJ | 495 |
| ADV | 193 |
| NUM | 91 |
| PRON | 76 |
| ADP | 74 |
| MODAL | 39 |
| INTJ | 22 |
| AUX | 17 |
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("Gulnur7/kazakh-lexical-complexity-classes")
```
## Citation
Released later
## License
This dataset is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
提供机构:
Gulnur7



