omdeep22/Konkani_books_corpus-v2

Name: omdeep22/Konkani_books_corpus-v2
Creator: omdeep22
Published: 2026-01-28 05:40:41
License: 暂无描述

Hugging Face2026-01-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/omdeep22/Konkani_books_corpus-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - kok - gom license: mit tags: - konkani - indian-languages - indic-nlp - low-resource-language - monolingual - text-corpus - books - nlp - language-modeling - text-generation - llm - llm-training - pretraining - fine-tuning - devanagari - romi task_categories: - text-generation - language-modeling pretty_name: Konkani Books Corpus v2 size_categories: - 1B<n<10B --- # Konkani Books Corpus v2 ## Overview **Konkani Books Corpus v2** is a large-scale monolingual text dataset for the **Konkani language** (`kok`, `gom`). It is primarily sourced from digitized books, literature, and long-form cultural texts. This dataset is specifically curated for **LLM pretraining, fine-tuning, tokenizer training, and NLP research** in low-resource Indian languages. - **Language**: Konkani (`kok`, `gom`) - **Format**: Plain text (`.txt`) - **Dataset Size**: 1.02 GB - **Primary Source**: Books / Long-form literature - **License**: MIT --- ## ⚠️ Important Note on Preprocessing The raw dataset contains significant whitespace irregularities, including large gaps between words and excessive line breaks resulting from the digitization/OCR process. > **Critical for LLM Training**: To ensure a stable **loss curve** and efficient tokenization, you **must normalize the whitespace** (densify the text) before training. Failure to remove these large gaps can lead to poor model convergence. --- ## How to Access the Dataset ### 1. Direct Download Use this to download the entire corpus to your local machine: ```python from datasets import load_dataset # Download the full dataset dataset = load_dataset("omdeep22/Konkani_books_corpus-v2", split="train") print(dataset[0])

提供机构：

omdeep22

5,000+

优质数据集

54 个

任务类型

进入经典数据集