five

bekan/karakalpak_corpus_v2_m

收藏
Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bekan/karakalpak_corpus_v2_m
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - kaa license: mit size_categories: - 1M<n<10M task_categories: - text-generation - fill-mask tags: - karakalpak - monolingual - nlp - corpus pretty_name: Karakalpak Monolingual Text Corpus (Kaa-Corpus) v2.0 --- # Karakalpak Monolingual Text Corpus (Kaa-Corpus) v2.0 ## Dataset Summary This dataset is a large-scale **monolingual Karakalpak text corpus** created for training and evaluating **Natural Language Processing (NLP)** systems and **Large Language Models (LLMs)**. The corpus contains millions of sentences collected from various high-quality written sources in **Standard Karakalpak (Latin script)**. --- ## Data Format The dataset is distributed in **JSON Lines (`.jsonl`) format**. Each line contains one text entry: ```json {"text": "Til bar jerde ǵana insan jámiyeti jasaydı."} ``` Encoding: **UTF-8** --- ## Dataset Statistics | Metric | Value | |------|------| | Sentences | ~135,667 | | Words | ~2,200,000 (estimated) | | Language | Karakalpak (Latin script) | --- ## Data Sources The corpus was compiled from publicly available and written materials including: - literary texts - educational materials - news and public domain documents All texts were normalized and converted into sentence-level format. --- ## Preprocessing The following preprocessing steps were applied: - whitespace normalization - removal of technical artifacts - sentence-level formatting - conversion to JSONL format --- ## Usage Load the dataset with the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("bekan/karakalpak_corpus_v2_m") print(dataset["train"][0]) ```
提供机构:
bekan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作