five

omdeep22/Konkani_books_corpus-v2

收藏
Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/omdeep22/Konkani_books_corpus-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - kok - gom license: mit tags: - konkani - indian-languages - indic-nlp - low-resource-language - monolingual - text-corpus - books - nlp - language-modeling - text-generation - llm - llm-training - pretraining - fine-tuning - devanagari - romi task_categories: - text-generation - language-modeling pretty_name: Konkani Books Corpus v2 size_categories: - 1B<n<10B --- # Konkani Books Corpus v2 ## Overview **Konkani Books Corpus v2** is a large-scale monolingual text dataset for the **Konkani language** (`kok`, `gom`). It is primarily sourced from digitized books, literature, and long-form cultural texts. This dataset is specifically curated for **LLM pretraining, fine-tuning, tokenizer training, and NLP research** in low-resource Indian languages. - **Language**: Konkani (`kok`, `gom`) - **Format**: Plain text (`.txt`) - **Dataset Size**: 1.02 GB - **Primary Source**: Books / Long-form literature - **License**: MIT --- ## ⚠️ Important Note on Preprocessing The raw dataset contains significant whitespace irregularities, including large gaps between words and excessive line breaks resulting from the digitization/OCR process. > **Critical for LLM Training**: To ensure a stable **loss curve** and efficient tokenization, you **must normalize the whitespace** (densify the text) before training. Failure to remove these large gaps can lead to poor model convergence. --- ## How to Access the Dataset ### 1. Direct Download Use this to download the entire corpus to your local machine: ```python from datasets import load_dataset # Download the full dataset dataset = load_dataset("omdeep22/Konkani_books_corpus-v2", split="train") print(dataset[0])
提供机构:
omdeep22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作