bekan/karakalpak_corpus_v2_m

Name: bekan/karakalpak_corpus_v2_m
Creator: bekan
Published: 2026-03-09 16:08:07
License: 暂无描述

Hugging Face2026-03-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/bekan/karakalpak_corpus_v2_m

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - kaa license: mit size_categories: - 1M<n<10M task_categories: - text-generation - fill-mask tags: - karakalpak - monolingual - nlp - corpus pretty_name: Karakalpak Monolingual Text Corpus (Kaa-Corpus) v2.0 --- # Karakalpak Monolingual Text Corpus (Kaa-Corpus) v2.0 ## Dataset Summary This dataset is a large-scale **monolingual Karakalpak text corpus** created for training and evaluating **Natural Language Processing (NLP)** systems and **Large Language Models (LLMs)**. The corpus contains millions of sentences collected from various high-quality written sources in **Standard Karakalpak (Latin script)**. --- ## Data Format The dataset is distributed in **JSON Lines (`.jsonl`) format**. Each line contains one text entry: ```json {"text": "Til bar jerde ǵana insan jámiyeti jasaydı."} ``` Encoding: **UTF-8** --- ## Dataset Statistics | Metric | Value | |------|------| | Sentences | ~135,667 | | Words | ~2,200,000 (estimated) | | Language | Karakalpak (Latin script) | --- ## Data Sources The corpus was compiled from publicly available and written materials including: - literary texts - educational materials - news and public domain documents All texts were normalized and converted into sentence-level format. --- ## Preprocessing The following preprocessing steps were applied: - whitespace normalization - removal of technical artifacts - sentence-level formatting - conversion to JSONL format --- ## Usage Load the dataset with the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("bekan/karakalpak_corpus_v2_m") print(dataset["train"][0]) ```

提供机构：

bekan

5,000+

优质数据集

54 个

任务类型

进入经典数据集