sh4lu-z/Sinhala-Mega-Corpus-v1

Name: sh4lu-z/Sinhala-Mega-Corpus-v1
Creator: sh4lu-z
Published: 2026-03-07 14:25:59
License: 暂无描述

Hugging Face2026-03-07 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sh4lu-z/Sinhala-Mega-Corpus-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - si license: cc-by-sa-4.0 size_categories: - 100K<n<1M task_categories: - text-generation - fill-mask tags: - nlp - sinhala - text-corpus - llm - pre-training - wikipedia - mc4 - cc100 pretty_name: Sinhala Mega Corpus v1 dataset_info: features: - name: text dtype: string - name: source dtype: string splits: - name: train num_examples: 323065 --- # Sinhala Mega Corpus v1 ## Description **English:** Sinhala Mega Corpus v1 is a large-scale, high-quality merged dataset specifically designed for training Sinhala Large Language Models (LLMs) and Tokenizers. It combines several major open-source datasets into a single, unified format, providing a diverse range of linguistic patterns from web crawls, encyclopedic knowledge, and conversational data. **සිංහල:** Sinhala Mega Corpus v1 යනු සිංහල Large Language Models (LLM) සහ Tokenizers පුහුණු කිරීම සඳහාම විශේෂයෙන් සකස් කරන ලද දැවැන්ත දත්ත පද්ධතියකි. මෙහි අන්තර්ජාල ලිපි, විශ්වකෝෂ දැනුම සහ එදිනෙදා කතාබහට අදාළ දත්ත ඇතුළත් විවෘත මූලාශ්‍ර දත්ත පද්ධති කිහිපයක් තනි ෆෝමැට් එකකට ගොනු කර ඇත. මෙය සිංහල AI පර්යේෂණ සඳහා ඉතා වටිනා සම්පතකි. --- ## Data Sources This dataset consists of samples from the following repositories: මෙම දත්ත පද්ධතිය පහත දැක්වෙන මූලාශ්‍රවල එකතුවකින් සමන්විත වේ: 1. **mC4 (Sinhala subset):** Large-scale web-crawled data by AllenAI. 2. **Sinhala Wikipedia:** High-quality encyclopedic content (2023 version). 3. **CC100 Sinhala:** Cleaned web data used for Meta's XLM-R. 4. **Awesome Dataset Sinhala:** Curated dataset by sh4lu-z. 5. **Common Voice (Sinhala Transcripts):** Spoken language patterns by Mozilla. --- ## Dataset Structure Each record in the dataset contains the following fields: සෑම දත්ත වාර්තාවකම පහත විස්තර ඇතුළත් වේ: * **text:** The actual Sinhala text content. (සිංහල පෙළ අන්තර්ගතය) * **source:** The origin of the data (e.g., wikipedia, mc4, etc.). (දත්ත ලබාගත් මූලාශ්‍රය) --- ## Licensing This dataset is released under the **Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)** license. This is mandatory as it includes data from Wikipedia and other CC-BY-SA sources. If you use this dataset, you must give credit to the original authors and share your work under the same license. මෙම දත්ත පද්ධතිය **CC BY-SA 4.0** බලපත්‍රය යටතේ නිකුත් කෙරේ. මෙහි විකිපීඩියා දත්ත අඩංගු බැවින් මෙම බලපත්‍රය භාවිතා කිරීම අනිවාර්ය වේ. ඔබ මෙම දත්ත භාවිත කරන්නේ නම්, මුල් මූලාශ්‍ර සඳහා ගෞරවය (Credit) ලබා දිය යුතු අතර ඔබගේ නිර්මාණයද මෙම බලපත්‍රය යටතේම පල කළ යුතුය. --- ## How to use You can load this dataset using the Hugging Face `datasets` library: ඔබට මෙම දත්ත පද්ධතිය පහත ආකාරයට ලෝඩ් කළ හැකිය: ```python import datasets ds = datasets.load_dataset("sh4lu-z/Sinhala-Mega-Corpus-v1") ``` --- **Maintained by:** [sh4lu-z](https://huggingface.co/sh4lu-z)

提供机构：

sh4lu-z

5,000+

优质数据集

54 个

任务类型

进入经典数据集