RaniduG/SiPaKosa-Sent

Name: RaniduG/SiPaKosa-Sent
Creator: RaniduG
Published: 2026-04-05 20:48:36
License: 暂无描述

Hugging Face2026-04-05 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/RaniduG/SiPaKosa-Sent

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - text-classification language: - si pretty_name: SiPaKosa size_categories: - 100K<n<1M configs: - config_name: sinhala_metadata data_files: - split: train path: data/sinhala/train.csv - split: validation path: data/sinhala/validation.csv - split: test path: data/sinhala/test.csv - config_name: mixed_metadata data_files: - split: train path: data/mixed/train.csv - split: validation path: data/mixed/validation.csv - split: test path: data/mixed/test.csv --- # SiPaKosa: Sinhala-Pali Buddhist Corpus A comprehensive corpus of canonical and classical Buddhist texts in Sinhala and Pali, compiled from historical archives and web-scraped canonical scriptures. This is the **sentence-level** version of the [SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa-Sent) dataset. Where SiPaKosa contains book level text, this dataset has all sentences by book. **Related dataset (book-level):** [RaniduG/SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa) ## Dataset Statistics - **Total Sentences**: 786,344 - **Sinhala Sentences**: 465,539 (59.2%) - **Mixed Sinhala-Pali**: 320,805 (40.8%) - **Sources**: 16 historical books (IFBC) + 5 Nikayas (Tripitaka) ## Dataset Configs There are four configs available: | Config | Format | Columns | Best for | |---|---|---|---| | `sinhala` | txt | text only | model training | | `mixed` | txt | text only | model training | | `sinhala_metadata` | csv | sentence_id, book_category, book_name_si, book_name_en, source, text, language | filtering by book or source | | `mixed_metadata` | csv | sentence_id, book_category, book_name_si, book_name_en, source, text, language | filtering by book or source | ## Dataset Structure ``` data/ ├── sinhala/ │ ├── train.txt │ ├── train.csv │ ├── validation.txt │ ├── validation.csv │ ├── test.txt │ └── test.csv └── mixed/ ├── train.txt ├── train.csv ├── validation.txt ├── validation.csv ├── test.txt └── test.csv ``` ## CSV Columns | Column | Description | Example | |---|---|---| | `sentence_id` | Globally unique sentence ID | 1 | | `book_category` | Category of the source book | books-related-to-the-tipitaka | | `book_name_si` | Sinhala book name (IFBC only) | විශුද්ධිමාර්ගය | | `book_name_en` | English book name (Tripitaka only) | Digha Nikaya | | `source` | Data source | IFBC or Tripitaka | | `text` | The sentence | මා හට අසන්නට ලැබුණේ... | | `language` | Language classification | sinhala or mixed | ## Metadata Structure The `metadata/` folder contains two sub-folders: `pdf/` and `tripitaka/`. ### `metadata/pdf/` Holds statistics and manifest data for the 16 digitised Buddhist books in the PDF corpus. - `corpus_manifest.json` — lists each book with its name (Sinhala and English), category, and file paths. - `corpus_statistics.json` — high-level summary: total books (16), total pages (7,064), language split (Sinhala vs. mixed), and category distribution. - `detailed_corpus_statistics.json` — per-book and per-category breakdown including word counts, character counts, and averages per page. Covers three categories: `books-related-to-the-tipitaka`, `old-books`, and `buddhist-characters`. ### `metadata/tripitaka/` Contains scraped sutta data from [tripitaka.online](https://tripitaka.online), organised by nikaya. Each nikaya has its own sub-folder (e.g., `digha/`, `majjhima/`, `anguttara/`). Inside each sub-folder: - `suttas_batch_{number}.json` — batched sutta records. Each entry contains the URL, title, Sinhala content, Pali content, word counts, nikaya info, scraping method, and timestamp. - `error_log.json` — records any suttas that failed to scrape. - `scraping_progress.json` — tracks how many suttas were scraped vs. errored. ## Quick Start ```python from datasets import load_dataset # Load plain text for model training sinhala_ds = load_dataset("RaniduG/SiPaKosa", "sinhala") print(sinhala_ds["train"][0]) mixed_ds = load_dataset("RaniduG/SiPaKosa", "mixed") print(mixed_ds["train"][0]) # Load with metadata for filtering by book or source sinhala_meta = load_dataset("RaniduG/SiPaKosa", "sinhala_metadata") print(sinhala_meta["train"][0]) # Filter by source import pandas as pd df = pd.DataFrame(sinhala_meta["train"]) ifbc_only = df[df["source"] == "IFBC"] tripitaka_only = df[df["source"] == "Tripitaka"] # Filter by book book_df = df[df["book_name_si"] == "විශුද්ධිමාර්ගය"] ``` ## Documentation - [Citation](docs/CITATION.bib) - How to cite this work ## License This dataset is released under MIT for research purposes. ## Paper **https://arxiv.org/abs/2603.29221**

许可证：MIT协议任务类别： - 文本生成 - 文本分类语言： - 僧伽罗语（Sinhala）友好名称：SiPaKosa 数据规模：100,000 < 样本量 < 1,000,000 配置项： - 配置名称：sinhala_metadata 数据文件： - 划分：训练集路径：data/sinhala/train.csv - 划分：验证集路径：data/sinhala/validation.csv - 划分：测试集路径：data/sinhala/test.csv - 配置名称：mixed_metadata 数据文件： - 划分：训练集路径：data/mixed/train.csv - 划分：验证集路径：data/mixed/validation.csv - 划分：测试集路径：data/mixed/test.csv # SiPaKosa：僧伽罗语-巴利语佛教语料库本语料库为僧伽罗语（Sinhala）与巴利语（Pali）的佛教经典及古典文本合集，数据源自历史档案与网络爬取的佛教经典经文。本数据集为[SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa-Sent)的**句子级**版本。原版SiPaKosa仅包含书籍级文本，而本数据集则按书籍拆分出所有句子。 **相关数据集（书籍级）：** [RaniduG/SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa) ## 数据集统计 - 总句子量：786,344 - 僧伽罗语句子：465,539（占比59.2%） - 僧伽罗语-巴利语混合句子：320,805（占比40.8%） - 数据来源：16部历史典籍（IFBC）与5部尼柯耶（三藏（Tripitaka）） ## 数据集配置 | 配置名称 | 格式 | 列名 | 适用场景 | |---|---|---|---| | `sinhala` | 纯文本（TXT） | 仅`text`列 | 模型训练 | | `mixed` | 纯文本（TXT） | 仅`text`列 | 模型训练 | | `sinhala_metadata` | 逗号分隔值文件（CSV） | `sentence_id`、`book_category`、`book_name_si`、`book_name_en`、`source`、`text`、`language` | 按书籍或来源筛选 | | `mixed_metadata` | 逗号分隔值文件（CSV） | `sentence_id`、`book_category`、`book_name_si`、`book_name_en`、`source`、`text`、`language` | 按书籍或来源筛选 | ## 数据集结构 data/ ├── sinhala/ │ ├── train.txt │ ├── train.csv │ ├── validation.txt │ ├── validation.csv │ ├── test.txt │ └── test.csv └── mixed/ ├── train.txt ├── train.csv ├── validation.txt ├── validation.csv ├── test.txt └── test.csv ## CSV列说明 | 列名 | 说明 | 示例 | |---|---|---| | `sentence_id` | 全局唯一句子标识符 | 1 | | `book_category` | 源书籍类别 | books-related-to-the-tipitaka | | `book_name_si` | 僧伽罗语书籍名称（仅IFBC来源） | විශුද්ධිමාර්ගය | | `book_name_en` | 英语书籍名称（仅三藏来源） | Digha Nikaya | | `source` | 数据来源 | IFBC 或 Tripitaka | | `text` | 句子文本 | මා හට අසන්නට ලැබුණේ... | | `language` | 语言分类 | sinhala 或 mixed | ## 元数据结构 `metadata/`文件夹包含两个子文件夹：`pdf/`与`tripitaka/`。 ### `metadata/pdf/` 存放16部数字化佛教典籍的统计信息与清单数据。 - `corpus_manifest.json`：列出每部典籍的名称（僧伽罗语与英语）、类别及文件路径。 - `corpus_statistics.json`：高级汇总信息：总典籍数（16部）、总页数（7,064）、语言分布（僧伽罗语与混合语）及类别分布。 - `detailed_corpus_statistics.json`：按典籍与类别细分的统计数据，包含词数、字符数及单页平均值。涵盖三个类别：`books-related-to-the-tipitaka`、`old-books`与`buddhist-characters`。 ### `metadata/tripitaka/` 包含从[tripitaka.online](https://tripitaka.online)爬取的经文数据，按尼柯耶组织。每个尼柯耶拥有独立子文件夹（如`digha/`、`majjhima/`、`anguttara/`）。每个子文件夹内包含： - `suttas_batch_{number}.json`：批量经文记录。每条记录包含URL、标题、僧伽罗语内容、巴利语内容、词数、尼柯耶信息、爬取方式与时间戳。 - `error_log.json`：记录爬取失败的经文。 - `scraping_progress.json`：追踪已爬取与失败的经文数量。 ## 快速上手 python from datasets import load_dataset # 加载纯文本数据集用于模型训练 sinhala_ds = load_dataset("RaniduG/SiPaKosa", "sinhala") print(sinhala_ds["train"][0]) mixed_ds = load_dataset("RaniduG/SiPaKosa", "mixed") print(mixed_ds["train"][0]) # 加载带元数据的数据集，用于按书籍或来源筛选 sinhala_meta = load_dataset("RaniduG/SiPaKosa", "sinhala_metadata") print(sinhala_meta["train"][0]) # 按来源筛选 import pandas as pd df = pd.DataFrame(sinhala_meta["train"]) ifbc_only = df[df["source"] == "IFBC"] tripitaka_only = df[df["source"] == "Tripitaka"] # 按书籍筛选 book_df = df[df["book_name_si"] == "විශුද්ධිමාර්ගය"] ## 文档说明 - [引用格式](docs/CITATION.bib)：本研究的引用方式 ## 许可证本数据集以MIT协议发布，仅用于科研用途。 ## 论文链接 **https://arxiv.org/abs/2603.29221**

提供机构：

RaniduG

5,000+

优质数据集

54 个

任务类型

进入经典数据集