five

Minuri/sinhala-corpus-a-news-1m

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/sinhala-corpus-a-news-1m
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - si license: cc-by-sa-4.0 task_categories: - text-generation pretty_name: News-Only Sinhala Corpus size_categories: - 1M<n<10M tags: - sinhala - low-resource - pretraining - news - domain-classified --- # News-Only Sinhala Corpus A news-domain subset of 1M Sinhala sentences sampled from the `Minuri/diverse_sinhala_dataset` corpus, used for continual pretraining of LLaMA 3.2 1B (Model A) as part of a diversity-driven Sinhala language model adaptation study at the Informatics Institute of Technology (IIT), Colombo, affiliated with Robert Gordon University (RGU). > **Corpus variants in this series:** > - `Minuri/sinhala-corpus-a-news-1m` - News-only subset (domain-homogeneous baseline) - this repo > - `Minuri/sinhala-corpus-b-random-1m` - Random subset (random baseline) > - `Minuri/sinhala-corpus-c-diverse-1m` - Diversity-optimized subset ✅ Best perplexity ## Dataset Description Corpus A serves as the **domain-homogeneous baseline**, comprising sentences drawn exclusively from the news domain of the parent corpus. This enables controlled comparison against the random (B) and diversity-optimized (C) corpora in downstream perplexity and evaluation experiments. The model trained on this corpus (Model A) achieved a perplexity of **14.68** on the Sinhala test set. ### Source Datasets (via parent corpus) | Source | Description | |---|---| | `culturax` | CulturaX multilingual web corpus (Sinhala subset) | | `nsina` | NSina Sinhala news corpus | | `madlad` | MADLAD-400 multilingual dataset (Sinhala subset) | | `wikipedia` | Sinhala Wikipedia dump | ## Dataset Structure | Column | Type | Description | |---|---|---| | `orig_index` | int | Original index in the parent corpus | | `sentence` | string | Sinhala sentence text | | `source` | string | Source dataset identifier | | `predicted_domain` | string | Domain label predicted by XLM-RoBERTa classifier | | `confidence` | float | Classifier confidence score | ### Splits | Split | Rows | |---|---| | train | 1,000,000 | ### Format Available in both JSONL and CSV formats. ## Intended Uses - Continual pretraining of LLMs on Sinhala (domain-homogeneous baseline) - Ablation studies on corpus diversity - Sinhala NLP benchmarking ## Associated Model This corpus was used to train: `Minuri/sinhala-llama-1b-corpus-news` ## Sources & Licenses This dataset contains sentences derived from the following source datasets. Users must comply with the license terms of each: | Source | License | Notes | |---|---|---| | [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) | ODC-BY | Attribution required | | [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) | mC4 + OSCAR licenses | Requires contact info agreement on HuggingFace before access | | [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | CC BY-SA 3.0 + GFDL | ShareAlike - derived works must carry same license | | [sinhala-nlp/NSINA](https://huggingface.co/datasets/sinhala-nlp/NSINA) | CC BY-SA 4.0 | ShareAlike - derived works must carry same license | This dataset is released under **CC BY-SA 4.0** in compliance with the ShareAlike terms of Wikipedia and NSINA.
提供机构:
Minuri
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作