omdeep22/Konkani_books_corpus-v2
收藏Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/omdeep22/Konkani_books_corpus-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- kok
- gom
license: mit
tags:
- konkani
- indian-languages
- indic-nlp
- low-resource-language
- monolingual
- text-corpus
- books
- nlp
- language-modeling
- text-generation
- llm
- llm-training
- pretraining
- fine-tuning
- devanagari
- romi
task_categories:
- text-generation
- language-modeling
pretty_name: Konkani Books Corpus v2
size_categories:
- 1B<n<10B
---
# Konkani Books Corpus v2
## Overview
**Konkani Books Corpus v2** is a large-scale monolingual text dataset for the **Konkani language** (`kok`, `gom`). It is primarily sourced from digitized books, literature, and long-form cultural texts. This dataset is specifically curated for **LLM pretraining, fine-tuning, tokenizer training, and NLP research** in low-resource Indian languages.
- **Language**: Konkani (`kok`, `gom`)
- **Format**: Plain text (`.txt`)
- **Dataset Size**: 1.02 GB
- **Primary Source**: Books / Long-form literature
- **License**: MIT
---
## ⚠️ Important Note on Preprocessing
The raw dataset contains significant whitespace irregularities, including large gaps between words and excessive line breaks resulting from the digitization/OCR process.
> **Critical for LLM Training**: To ensure a stable **loss curve** and efficient tokenization, you **must normalize the whitespace** (densify the text) before training. Failure to remove these large gaps can lead to poor model convergence.
---
## How to Access the Dataset
### 1. Direct Download
Use this to download the entire corpus to your local machine:
```python
from datasets import load_dataset
# Download the full dataset
dataset = load_dataset("omdeep22/Konkani_books_corpus-v2", split="train")
print(dataset[0])
提供机构:
omdeep22



