five

Chamaka8/serendip-cpt-sinhala

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Chamaka8/serendip-cpt-sinhala
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - si license: cc-by-4.0 task_categories: - text-generation pretty_name: Serendib LLM CPT Sinhala Corpus size_categories: - 10M<n<100M tags: - sinhala - pretraining - continual-pretraining - low-resource - nlp - llama - language-model configs: - config_name: default data_files: - split: train path: data/raw/train_shards_clean/*.jsonl - split: validation path: data/raw/val.jsonl --- # Serendib LLM CPT Sinhala Corpus A large-scale, deduplicated, quality-filtered Sinhala plain-text corpus built for Continual Pre-Training (CPT) of large language models. This dataset was used to adapt **Meta-LLaMA-3-8B** to the Sinhala language domain as part of the [Serendib LLM](https://huggingface.co/Chamaka8) Honours Degree Research Project at the University of Central Lancashire (UCLan), 2025–2026. This is one of the largest openly published Sinhala NLP corpora available, containing **23,449,223 training documents** across 11 cleaned shards totalling **21.6 GB** of storage. --- ## Dataset Statistics ### Splits | Split | File(s) | Entries | Notes | |---|---|---|---| | Train (clean) | `train_shards_clean/train_000–010.jsonl` | **23,449,223** | Primary training data | | Validation | `val.jsonl` | 23,150 | Full validation set | | Val Small | `val_small.jsonl` | 2,000 | Quick validation subset | | Train Small | `train_small.jsonl` | 20,000 | Quick training subset | | **Grand Total** | | **23,494,373** | | ### Per-Shard Breakdown (Clean) | Shard | Entries | Text Size | Avg Doc Length | |---|---|---|---| | train_000.jsonl | 2,410,010 | 549.0 MB | 238 chars | | train_001.jsonl | 2,402,042 | 548.6 MB | 239 chars | | train_002.jsonl | 2,401,723 | 549.0 MB | 239 chars | | train_003.jsonl | 2,118,331 | 535.9 MB | 265 chars | | train_004.jsonl | 348,806 | 425.8 MB | 1,279 chars | | train_005.jsonl | 2,821,090 | 511.7 MB | 190 chars | | train_006.jsonl | 2,805,252 | 511.0 MB | 191 chars | | train_007.jsonl | 2,744,722 | 508.5 MB | 194 chars | | train_008.jsonl | 2,668,897 | 505.2 MB | 198 chars | | train_009.jsonl | 2,585,353 | 501.6 MB | 203 chars | | train_010.jsonl | 142,997 | 28.1 MB | 206 chars | | **Total** | **23,449,223** | **~5.17 GB text** | **~234 chars avg** | > Note: train_004 has a significantly higher average document length (1,279 chars) compared to other shards (~200 chars), indicating it contains long-form documents such as Wikipedia articles and government publications. ### Repository Storage Total LFS storage: **21.6 GB** (raw + clean shards, scripts, validation sets) --- ## Schema Each document is stored as a JSON object. Fields vary slightly across shards due to source diversity: | Field | Type | Present In | Description | |---|---|---|---| | `id` | string | All shards | Unique document identifier | | `text` | string | All shards | Sinhala plain text content | | `source` | string | All shards | Origin website or corpus source | | `label` | string | train_000, val | Document category label | | `lang` | string | train_001–003 | Language tag (si) | | `src` | string | train_001–003 | Source URL or reference | ### Sample Entry ```json { "id": "si_001234", "text": "ශ්‍රී ලංකාව දිවයිනක් වන අතර එය ඉන්දියාවේ දකුණු කෙළවරට නුදුරින් පිහිටා ඇත...", "source": "sinhala.wikipedia.org" } ``` --- ## Data Sources The corpus was assembled from six distinct Sinhala web source categories to maximise linguistic diversity and domain coverage: | Source Type | Examples | Content Style | |---|---|---| | Sinhala Wikipedia | sinhala.wikipedia.org | Encyclopaedic, formal prose | | News portals | AdaDerana, NewsFirst, Lankadeepa | News articles, formal | | Government publications | Official Sinhala documents | Formal, institutional | | Online forums | Sinhala discussion boards | Informal, conversational | | Educational materials | Sinhala-medium academic content | Educational, structured | | Literary blogs | Sinhala poetry and essay sites | Creative, literary | --- ## Cleaning Pipeline A 7-stage cleaning pipeline was applied to the raw shards to produce `train_shards_clean/`: | Step | Method | Threshold / Detail | |---|---|---| | 1. HTML removal | BeautifulSoup `get_text()` + regex | Strips nav links, ads, JS snippets | | 2. Encoding filter | Repeated character detection | Entries with >5 consecutive identical chars removed | | 3. Numeric filter | Numeric content ratio check | >15% numeric content discarded | | 4. Exact deduplication | SHA-256 hash of normalised string | Identical documents removed | | 5. Near-dedup | 5-gram Jaccard similarity | Threshold 0.85 — near-identical republications removed | | 6. Length filter | Token count check | <20 or >2,048 tokens discarded | | 7. Language filter | Unicode range U+0D80–U+0DFF | <40% Sinhala characters discarded | --- ## Repository Structure ``` serendip-cpt-sinhala/ ├── data/ │ └── raw/ │ ├── train_shards/ # Raw scraped shards (000–010) │ ├── train_shards_clean/ # Cleaned & deduplicated shards (000–010) │ ├── train_small.jsonl # 20,000 entry quick-train subset │ ├── val.jsonl # 23,150 entry validation set │ └── val_small.jsonl # 2,000 entry quick-val subset ├── train_scripts/ │ ├── train_cpt.py # Continual pre-training script │ ├── train_sft_lora.py # SFT LoRA fine-tuning script │ ├── train_dpo.py # DPO alignment script │ ├── merge_lora.py # LoRA adapter merge script │ ├── load_data.py # Dataset loading utilities │ ├── rebuild_clean_shards.py # Cleaning pipeline script │ ├── sample_preview.py # Data preview utility │ ├── sanity_check_dataset.py # Dataset validation script │ └── requirements.txt # Python dependencies ├── runpod/ │ ├── run_sft_lora.sh # RunPod training launch script │ ├── setup_env.sh # RunPod environment setup │ └── .env.example # Environment variable template ├── test_base.py # Base model test script ├── test_cpt.py # CPT model test script ├── mini_run.py # Quick sanity check run └── compare_small_vs_cpt.py # Base vs CPT comparison script ``` --- ## Usage ### Load with Hugging Face Datasets ```python from datasets import load_dataset # Load full training set ds = load_dataset("Chamaka8/serendip-cpt-sinhala", split="train") print(f"Training entries: {len(ds):,}") print(ds[0]) # Load validation set val = load_dataset("Chamaka8/serendip-cpt-sinhala", split="validation") print(f"Validation entries: {len(val):,}") ``` ### Load a Single Shard ```python from datasets import load_dataset shard = load_dataset( "json", data_files="hf://datasets/Chamaka8/serendip-cpt-sinhala/data/raw/train_shards_clean/train_000.jsonl" ) print(shard) ``` ### Use for Continual Pre-Training ```python from datasets import load_dataset from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Chamaka8/serendib-tokenizer") ds = load_dataset("Chamaka8/serendip-cpt-sinhala", split="train", streaming=True) for example in ds: tokens = tokenizer(example["text"], truncation=True, max_length=2048) # Feed to your CPT training loop break ``` --- ## Trained Models This corpus was used to train the following models: | Model | Description | Link | |---|---|---| | serendib-llm-cpt-llama3-8b | CPT base model trained on this corpus | [🤗 Hub](https://huggingface.co/Chamaka8/serendib-llm-cpt-llama3-8b) | | Serendip-LLM-CPT-SFT-v2 | SFT model built on CPT base | [🤗 Hub](https://huggingface.co/Chamaka8/Serendip-LLM-CPT-SFT-v2) | --- ## Related Datasets | Dataset | Description | Link | |---|---|---| | Serendip-sft-sinhala | 603,000 Sinhala instruction-response pairs | [🤗 Hub](https://huggingface.co/datasets/Chamaka8/Serendip-sft-sinhala) | | SerendibLLM-PoemSong-Dataset | 19,184 Sinhala poem and song generation entries | [🤗 Hub](https://huggingface.co/datasets/Chamaka8/SerendibLLM-PoemSong-Dataset) | --- ## Hardware Used for Training - **GPU**: NVIDIA RTX A4500 (20 GB VRAM) and A100 (40 GB VRAM) on RunPod - **Framework**: PyTorch + HuggingFace Transformers + PEFT - **Quantisation**: 4-bit NF4 (QLoRA) during fine-tuning --- ## Limitations - Corpus is weighted toward formal, urban Sinhala (news, Wikipedia, government sources) - Dialectal, informal, and rural Sinhala registers are underrepresented - Some shards have inconsistent metadata fields due to multi-source scraping --- ## Licence This dataset is released under **Creative Commons Attribution 4.0 (CC BY 4.0)**. You are free to use, share, and adapt this data for any purpose with attribution. --- ## Citation ```bibtex @misc{amarasinghe2026serendib, title = {Serendib LLM: A Sinhala Large Language Model for NLP Benchmarks and Poem Generation}, author = {Amarasinghe, Chamaka}, year = {2026}, school = {University of Central Lancashire}, url = {https://huggingface.co/Chamaka8} } ``` --- ## Contact **Chamaka Amarasinghe** University of Central Lancashire — CO3008 Honours Degree Project 2025–2026 GitHub: [chamakarochana](https://github.com/chamakarochana) LinkedIn: [chamaka-amarasinghe-b54904211](https://www.linkedin.com/in/chamaka-amarasinghe-b54904211)
提供机构:
Chamaka8
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作