Chamaka8/serendip-cpt-sinhala
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Chamaka8/serendip-cpt-sinhala
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-4.0
task_categories:
- text-generation
pretty_name: Serendib LLM CPT Sinhala Corpus
size_categories:
- 10M<n<100M
tags:
- sinhala
- pretraining
- continual-pretraining
- low-resource
- nlp
- llama
- language-model
configs:
- config_name: default
data_files:
- split: train
path: data/raw/train_shards_clean/*.jsonl
- split: validation
path: data/raw/val.jsonl
---
# Serendib LLM CPT Sinhala Corpus
A large-scale, deduplicated, quality-filtered Sinhala plain-text corpus built for
Continual Pre-Training (CPT) of large language models. This dataset was used to adapt
**Meta-LLaMA-3-8B** to the Sinhala language domain as part of the
[Serendib LLM](https://huggingface.co/Chamaka8) Honours Degree Research Project
at the University of Central Lancashire (UCLan), 2025–2026.
This is one of the largest openly published Sinhala NLP corpora available, containing
**23,449,223 training documents** across 11 cleaned shards totalling **21.6 GB** of storage.
---
## Dataset Statistics
### Splits
| Split | File(s) | Entries | Notes |
|---|---|---|---|
| Train (clean) | `train_shards_clean/train_000–010.jsonl` | **23,449,223** | Primary training data |
| Validation | `val.jsonl` | 23,150 | Full validation set |
| Val Small | `val_small.jsonl` | 2,000 | Quick validation subset |
| Train Small | `train_small.jsonl` | 20,000 | Quick training subset |
| **Grand Total** | | **23,494,373** | |
### Per-Shard Breakdown (Clean)
| Shard | Entries | Text Size | Avg Doc Length |
|---|---|---|---|
| train_000.jsonl | 2,410,010 | 549.0 MB | 238 chars |
| train_001.jsonl | 2,402,042 | 548.6 MB | 239 chars |
| train_002.jsonl | 2,401,723 | 549.0 MB | 239 chars |
| train_003.jsonl | 2,118,331 | 535.9 MB | 265 chars |
| train_004.jsonl | 348,806 | 425.8 MB | 1,279 chars |
| train_005.jsonl | 2,821,090 | 511.7 MB | 190 chars |
| train_006.jsonl | 2,805,252 | 511.0 MB | 191 chars |
| train_007.jsonl | 2,744,722 | 508.5 MB | 194 chars |
| train_008.jsonl | 2,668,897 | 505.2 MB | 198 chars |
| train_009.jsonl | 2,585,353 | 501.6 MB | 203 chars |
| train_010.jsonl | 142,997 | 28.1 MB | 206 chars |
| **Total** | **23,449,223** | **~5.17 GB text** | **~234 chars avg** |
> Note: train_004 has a significantly higher average document length (1,279 chars) compared to other shards (~200 chars), indicating it contains long-form documents such as Wikipedia articles and government publications.
### Repository Storage
Total LFS storage: **21.6 GB** (raw + clean shards, scripts, validation sets)
---
## Schema
Each document is stored as a JSON object. Fields vary slightly across shards due to source diversity:
| Field | Type | Present In | Description |
|---|---|---|---|
| `id` | string | All shards | Unique document identifier |
| `text` | string | All shards | Sinhala plain text content |
| `source` | string | All shards | Origin website or corpus source |
| `label` | string | train_000, val | Document category label |
| `lang` | string | train_001–003 | Language tag (si) |
| `src` | string | train_001–003 | Source URL or reference |
### Sample Entry
```json
{
"id": "si_001234",
"text": "ශ්රී ලංකාව දිවයිනක් වන අතර එය ඉන්දියාවේ දකුණු කෙළවරට නුදුරින් පිහිටා ඇත...",
"source": "sinhala.wikipedia.org"
}
```
---
## Data Sources
The corpus was assembled from six distinct Sinhala web source categories to maximise
linguistic diversity and domain coverage:
| Source Type | Examples | Content Style |
|---|---|---|
| Sinhala Wikipedia | sinhala.wikipedia.org | Encyclopaedic, formal prose |
| News portals | AdaDerana, NewsFirst, Lankadeepa | News articles, formal |
| Government publications | Official Sinhala documents | Formal, institutional |
| Online forums | Sinhala discussion boards | Informal, conversational |
| Educational materials | Sinhala-medium academic content | Educational, structured |
| Literary blogs | Sinhala poetry and essay sites | Creative, literary |
---
## Cleaning Pipeline
A 7-stage cleaning pipeline was applied to the raw shards to produce `train_shards_clean/`:
| Step | Method | Threshold / Detail |
|---|---|---|
| 1. HTML removal | BeautifulSoup `get_text()` + regex | Strips nav links, ads, JS snippets |
| 2. Encoding filter | Repeated character detection | Entries with >5 consecutive identical chars removed |
| 3. Numeric filter | Numeric content ratio check | >15% numeric content discarded |
| 4. Exact deduplication | SHA-256 hash of normalised string | Identical documents removed |
| 5. Near-dedup | 5-gram Jaccard similarity | Threshold 0.85 — near-identical republications removed |
| 6. Length filter | Token count check | <20 or >2,048 tokens discarded |
| 7. Language filter | Unicode range U+0D80–U+0DFF | <40% Sinhala characters discarded |
---
## Repository Structure
```
serendip-cpt-sinhala/
├── data/
│ └── raw/
│ ├── train_shards/ # Raw scraped shards (000–010)
│ ├── train_shards_clean/ # Cleaned & deduplicated shards (000–010)
│ ├── train_small.jsonl # 20,000 entry quick-train subset
│ ├── val.jsonl # 23,150 entry validation set
│ └── val_small.jsonl # 2,000 entry quick-val subset
├── train_scripts/
│ ├── train_cpt.py # Continual pre-training script
│ ├── train_sft_lora.py # SFT LoRA fine-tuning script
│ ├── train_dpo.py # DPO alignment script
│ ├── merge_lora.py # LoRA adapter merge script
│ ├── load_data.py # Dataset loading utilities
│ ├── rebuild_clean_shards.py # Cleaning pipeline script
│ ├── sample_preview.py # Data preview utility
│ ├── sanity_check_dataset.py # Dataset validation script
│ └── requirements.txt # Python dependencies
├── runpod/
│ ├── run_sft_lora.sh # RunPod training launch script
│ ├── setup_env.sh # RunPod environment setup
│ └── .env.example # Environment variable template
├── test_base.py # Base model test script
├── test_cpt.py # CPT model test script
├── mini_run.py # Quick sanity check run
└── compare_small_vs_cpt.py # Base vs CPT comparison script
```
---
## Usage
### Load with Hugging Face Datasets
```python
from datasets import load_dataset
# Load full training set
ds = load_dataset("Chamaka8/serendip-cpt-sinhala", split="train")
print(f"Training entries: {len(ds):,}")
print(ds[0])
# Load validation set
val = load_dataset("Chamaka8/serendip-cpt-sinhala", split="validation")
print(f"Validation entries: {len(val):,}")
```
### Load a Single Shard
```python
from datasets import load_dataset
shard = load_dataset(
"json",
data_files="hf://datasets/Chamaka8/serendip-cpt-sinhala/data/raw/train_shards_clean/train_000.jsonl"
)
print(shard)
```
### Use for Continual Pre-Training
```python
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Chamaka8/serendib-tokenizer")
ds = load_dataset("Chamaka8/serendip-cpt-sinhala", split="train", streaming=True)
for example in ds:
tokens = tokenizer(example["text"], truncation=True, max_length=2048)
# Feed to your CPT training loop
break
```
---
## Trained Models
This corpus was used to train the following models:
| Model | Description | Link |
|---|---|---|
| serendib-llm-cpt-llama3-8b | CPT base model trained on this corpus | [🤗 Hub](https://huggingface.co/Chamaka8/serendib-llm-cpt-llama3-8b) |
| Serendip-LLM-CPT-SFT-v2 | SFT model built on CPT base | [🤗 Hub](https://huggingface.co/Chamaka8/Serendip-LLM-CPT-SFT-v2) |
---
## Related Datasets
| Dataset | Description | Link |
|---|---|---|
| Serendip-sft-sinhala | 603,000 Sinhala instruction-response pairs | [🤗 Hub](https://huggingface.co/datasets/Chamaka8/Serendip-sft-sinhala) |
| SerendibLLM-PoemSong-Dataset | 19,184 Sinhala poem and song generation entries | [🤗 Hub](https://huggingface.co/datasets/Chamaka8/SerendibLLM-PoemSong-Dataset) |
---
## Hardware Used for Training
- **GPU**: NVIDIA RTX A4500 (20 GB VRAM) and A100 (40 GB VRAM) on RunPod
- **Framework**: PyTorch + HuggingFace Transformers + PEFT
- **Quantisation**: 4-bit NF4 (QLoRA) during fine-tuning
---
## Limitations
- Corpus is weighted toward formal, urban Sinhala (news, Wikipedia, government sources)
- Dialectal, informal, and rural Sinhala registers are underrepresented
- Some shards have inconsistent metadata fields due to multi-source scraping
---
## Licence
This dataset is released under **Creative Commons Attribution 4.0 (CC BY 4.0)**.
You are free to use, share, and adapt this data for any purpose with attribution.
---
## Citation
```bibtex
@misc{amarasinghe2026serendib,
title = {Serendib LLM: A Sinhala Large Language Model for NLP Benchmarks and Poem Generation},
author = {Amarasinghe, Chamaka},
year = {2026},
school = {University of Central Lancashire},
url = {https://huggingface.co/Chamaka8}
}
```
---
## Contact
**Chamaka Amarasinghe**
University of Central Lancashire — CO3008 Honours Degree Project 2025–2026
GitHub: [chamakarochana](https://github.com/chamakarochana)
LinkedIn: [chamaka-amarasinghe-b54904211](https://www.linkedin.com/in/chamaka-amarasinghe-b54904211)
提供机构:
Chamaka8



