five

RaniduG/SiPaKosa-Sent

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/RaniduG/SiPaKosa-Sent
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - text-classification language: - si pretty_name: SiPaKosa size_categories: - 100K<n<1M configs: - config_name: sinhala_metadata data_files: - split: train path: data/sinhala/train.csv - split: validation path: data/sinhala/validation.csv - split: test path: data/sinhala/test.csv - config_name: mixed_metadata data_files: - split: train path: data/mixed/train.csv - split: validation path: data/mixed/validation.csv - split: test path: data/mixed/test.csv --- # SiPaKosa: Sinhala-Pali Buddhist Corpus A comprehensive corpus of canonical and classical Buddhist texts in Sinhala and Pali, compiled from historical archives and web-scraped canonical scriptures. This is the **sentence-level** version of the [SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa-Sent) dataset. Where SiPaKosa contains book level text, this dataset has all sentences by book. **Related dataset (book-level):** [RaniduG/SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa) ## Dataset Statistics - **Total Sentences**: 786,344 - **Sinhala Sentences**: 465,539 (59.2%) - **Mixed Sinhala-Pali**: 320,805 (40.8%) - **Sources**: 16 historical books (IFBC) + 5 Nikayas (Tripitaka) ## Dataset Configs There are four configs available: | Config | Format | Columns | Best for | |---|---|---|---| | `sinhala` | txt | text only | model training | | `mixed` | txt | text only | model training | | `sinhala_metadata` | csv | sentence_id, book_category, book_name_si, book_name_en, source, text, language | filtering by book or source | | `mixed_metadata` | csv | sentence_id, book_category, book_name_si, book_name_en, source, text, language | filtering by book or source | ## Dataset Structure ``` data/ ├── sinhala/ │ ├── train.txt │ ├── train.csv │ ├── validation.txt │ ├── validation.csv │ ├── test.txt │ └── test.csv └── mixed/ ├── train.txt ├── train.csv ├── validation.txt ├── validation.csv ├── test.txt └── test.csv ``` ## CSV Columns | Column | Description | Example | |---|---|---| | `sentence_id` | Globally unique sentence ID | 1 | | `book_category` | Category of the source book | books-related-to-the-tipitaka | | `book_name_si` | Sinhala book name (IFBC only) | විශුද්ධිමාර්ගය | | `book_name_en` | English book name (Tripitaka only) | Digha Nikaya | | `source` | Data source | IFBC or Tripitaka | | `text` | The sentence | මා හට අසන්නට ලැබුණේ... | | `language` | Language classification | sinhala or mixed | ## Metadata Structure The `metadata/` folder contains two sub-folders: `pdf/` and `tripitaka/`. ### `metadata/pdf/` Holds statistics and manifest data for the 16 digitised Buddhist books in the PDF corpus. - `corpus_manifest.json` — lists each book with its name (Sinhala and English), category, and file paths. - `corpus_statistics.json` — high-level summary: total books (16), total pages (7,064), language split (Sinhala vs. mixed), and category distribution. - `detailed_corpus_statistics.json` — per-book and per-category breakdown including word counts, character counts, and averages per page. Covers three categories: `books-related-to-the-tipitaka`, `old-books`, and `buddhist-characters`. ### `metadata/tripitaka/` Contains scraped sutta data from [tripitaka.online](https://tripitaka.online), organised by nikaya. Each nikaya has its own sub-folder (e.g., `digha/`, `majjhima/`, `anguttara/`). Inside each sub-folder: - `suttas_batch_{number}.json` — batched sutta records. Each entry contains the URL, title, Sinhala content, Pali content, word counts, nikaya info, scraping method, and timestamp. - `error_log.json` — records any suttas that failed to scrape. - `scraping_progress.json` — tracks how many suttas were scraped vs. errored. ## Quick Start ```python from datasets import load_dataset # Load plain text for model training sinhala_ds = load_dataset("RaniduG/SiPaKosa", "sinhala") print(sinhala_ds["train"][0]) mixed_ds = load_dataset("RaniduG/SiPaKosa", "mixed") print(mixed_ds["train"][0]) # Load with metadata for filtering by book or source sinhala_meta = load_dataset("RaniduG/SiPaKosa", "sinhala_metadata") print(sinhala_meta["train"][0]) # Filter by source import pandas as pd df = pd.DataFrame(sinhala_meta["train"]) ifbc_only = df[df["source"] == "IFBC"] tripitaka_only = df[df["source"] == "Tripitaka"] # Filter by book book_df = df[df["book_name_si"] == "විශුද්ධිමාර්ගය"] ``` ## Documentation - [Citation](docs/CITATION.bib) - How to cite this work ## License This dataset is released under MIT for research purposes. ## Paper **https://arxiv.org/abs/2603.29221**

许可证:MIT协议 任务类别: - 文本生成 - 文本分类 语言: - 僧伽罗语(Sinhala) 友好名称:SiPaKosa 数据规模:100,000 < 样本量 < 1,000,000 配置项: - 配置名称:sinhala_metadata 数据文件: - 划分:训练集 路径:data/sinhala/train.csv - 划分:验证集 路径:data/sinhala/validation.csv - 划分:测试集 路径:data/sinhala/test.csv - 配置名称:mixed_metadata 数据文件: - 划分:训练集 路径:data/mixed/train.csv - 划分:验证集 路径:data/mixed/validation.csv - 划分:测试集 路径:data/mixed/test.csv # SiPaKosa:僧伽罗语-巴利语佛教语料库 本语料库为僧伽罗语(Sinhala)与巴利语(Pali)的佛教经典及古典文本合集,数据源自历史档案与网络爬取的佛教经典经文。 本数据集为[SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa-Sent)的**句子级**版本。原版SiPaKosa仅包含书籍级文本,而本数据集则按书籍拆分出所有句子。 **相关数据集(书籍级):** [RaniduG/SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa) ## 数据集统计 - 总句子量:786,344 - 僧伽罗语句子:465,539(占比59.2%) - 僧伽罗语-巴利语混合句子:320,805(占比40.8%) - 数据来源:16部历史典籍(IFBC)与5部尼柯耶(三藏(Tripitaka)) ## 数据集配置 | 配置名称 | 格式 | 列名 | 适用场景 | |---|---|---|---| | `sinhala` | 纯文本(TXT) | 仅`text`列 | 模型训练 | | `mixed` | 纯文本(TXT) | 仅`text`列 | 模型训练 | | `sinhala_metadata` | 逗号分隔值文件(CSV) | `sentence_id`、`book_category`、`book_name_si`、`book_name_en`、`source`、`text`、`language` | 按书籍或来源筛选 | | `mixed_metadata` | 逗号分隔值文件(CSV) | `sentence_id`、`book_category`、`book_name_si`、`book_name_en`、`source`、`text`、`language` | 按书籍或来源筛选 | ## 数据集结构 data/ ├── sinhala/ │ ├── train.txt │ ├── train.csv │ ├── validation.txt │ ├── validation.csv │ ├── test.txt │ └── test.csv └── mixed/ ├── train.txt ├── train.csv ├── validation.txt ├── validation.csv ├── test.txt └── test.csv ## CSV列说明 | 列名 | 说明 | 示例 | |---|---|---| | `sentence_id` | 全局唯一句子标识符 | 1 | | `book_category` | 源书籍类别 | books-related-to-the-tipitaka | | `book_name_si` | 僧伽罗语书籍名称(仅IFBC来源) | විශුද්ධිමාර්ගය | | `book_name_en` | 英语书籍名称(仅三藏来源) | Digha Nikaya | | `source` | 数据来源 | IFBC 或 Tripitaka | | `text` | 句子文本 | මා හට අසන්නට ලැබුණේ... | | `language` | 语言分类 | sinhala 或 mixed | ## 元数据结构 `metadata/`文件夹包含两个子文件夹:`pdf/`与`tripitaka/`。 ### `metadata/pdf/` 存放16部数字化佛教典籍的统计信息与清单数据。 - `corpus_manifest.json`:列出每部典籍的名称(僧伽罗语与英语)、类别及文件路径。 - `corpus_statistics.json`:高级汇总信息:总典籍数(16部)、总页数(7,064)、语言分布(僧伽罗语与混合语)及类别分布。 - `detailed_corpus_statistics.json`:按典籍与类别细分的统计数据,包含词数、字符数及单页平均值。涵盖三个类别:`books-related-to-the-tipitaka`、`old-books`与`buddhist-characters`。 ### `metadata/tripitaka/` 包含从[tripitaka.online](https://tripitaka.online)爬取的经文数据,按尼柯耶组织。每个尼柯耶拥有独立子文件夹(如`digha/`、`majjhima/`、`anguttara/`)。 每个子文件夹内包含: - `suttas_batch_{number}.json`:批量经文记录。每条记录包含URL、标题、僧伽罗语内容、巴利语内容、词数、尼柯耶信息、爬取方式与时间戳。 - `error_log.json`:记录爬取失败的经文。 - `scraping_progress.json`:追踪已爬取与失败的经文数量。 ## 快速上手 python from datasets import load_dataset # 加载纯文本数据集用于模型训练 sinhala_ds = load_dataset("RaniduG/SiPaKosa", "sinhala") print(sinhala_ds["train"][0]) mixed_ds = load_dataset("RaniduG/SiPaKosa", "mixed") print(mixed_ds["train"][0]) # 加载带元数据的数据集,用于按书籍或来源筛选 sinhala_meta = load_dataset("RaniduG/SiPaKosa", "sinhala_metadata") print(sinhala_meta["train"][0]) # 按来源筛选 import pandas as pd df = pd.DataFrame(sinhala_meta["train"]) ifbc_only = df[df["source"] == "IFBC"] tripitaka_only = df[df["source"] == "Tripitaka"] # 按书籍筛选 book_df = df[df["book_name_si"] == "විශුද්ධිමාර්ගය"] ## 文档说明 - [引用格式](docs/CITATION.bib):本研究的引用方式 ## 许可证 本数据集以MIT协议发布,仅用于科研用途。 ## 论文链接 **https://arxiv.org/abs/2603.29221**
提供机构:
RaniduG
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作