bekan/karakalpak_corpus_v2_m
收藏Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bekan/karakalpak_corpus_v2_m
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- kaa
license: mit
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- fill-mask
tags:
- karakalpak
- monolingual
- nlp
- corpus
pretty_name: Karakalpak Monolingual Text Corpus (Kaa-Corpus) v2.0
---
# Karakalpak Monolingual Text Corpus (Kaa-Corpus) v2.0
## Dataset Summary
This dataset is a large-scale **monolingual Karakalpak text corpus** created for training and evaluating **Natural Language Processing (NLP)** systems and **Large Language Models (LLMs)**.
The corpus contains millions of sentences collected from various high-quality written sources in **Standard Karakalpak (Latin script)**.
---
## Data Format
The dataset is distributed in **JSON Lines (`.jsonl`) format**.
Each line contains one text entry:
```json
{"text": "Til bar jerde ǵana insan jámiyeti jasaydı."}
```
Encoding: **UTF-8**
---
## Dataset Statistics
| Metric | Value |
|------|------|
| Sentences | ~135,667 |
| Words | ~2,200,000 (estimated) |
| Language | Karakalpak (Latin script) |
---
## Data Sources
The corpus was compiled from publicly available and written materials including:
- literary texts
- educational materials
- news and public domain documents
All texts were normalized and converted into sentence-level format.
---
## Preprocessing
The following preprocessing steps were applied:
- whitespace normalization
- removal of technical artifacts
- sentence-level formatting
- conversion to JSONL format
---
## Usage
Load the dataset with the Hugging Face `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("bekan/karakalpak_corpus_v2_m")
print(dataset["train"][0])
```
提供机构:
bekan



