RaniduG/SiPaKosa-Sent
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/RaniduG/SiPaKosa-Sent
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- text-classification
language:
- si
pretty_name: SiPaKosa
size_categories:
- 100K<n<1M
configs:
- config_name: sinhala_metadata
data_files:
- split: train
path: data/sinhala/train.csv
- split: validation
path: data/sinhala/validation.csv
- split: test
path: data/sinhala/test.csv
- config_name: mixed_metadata
data_files:
- split: train
path: data/mixed/train.csv
- split: validation
path: data/mixed/validation.csv
- split: test
path: data/mixed/test.csv
---
# SiPaKosa: Sinhala-Pali Buddhist Corpus
A comprehensive corpus of canonical and classical Buddhist texts in Sinhala and Pali, compiled from historical archives and web-scraped canonical scriptures.
This is the **sentence-level** version of the [SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa-Sent) dataset.
Where SiPaKosa contains book level text, this dataset has all sentences by book.
**Related dataset (book-level):** [RaniduG/SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa)
## Dataset Statistics
- **Total Sentences**: 786,344
- **Sinhala Sentences**: 465,539 (59.2%)
- **Mixed Sinhala-Pali**: 320,805 (40.8%)
- **Sources**: 16 historical books (IFBC) + 5 Nikayas (Tripitaka)
## Dataset Configs
There are four configs available:
| Config | Format | Columns | Best for |
|---|---|---|---|
| `sinhala` | txt | text only | model training |
| `mixed` | txt | text only | model training |
| `sinhala_metadata` | csv | sentence_id, book_category, book_name_si, book_name_en, source, text, language | filtering by book or source |
| `mixed_metadata` | csv | sentence_id, book_category, book_name_si, book_name_en, source, text, language | filtering by book or source |
## Dataset Structure
```
data/
├── sinhala/
│ ├── train.txt
│ ├── train.csv
│ ├── validation.txt
│ ├── validation.csv
│ ├── test.txt
│ └── test.csv
└── mixed/
├── train.txt
├── train.csv
├── validation.txt
├── validation.csv
├── test.txt
└── test.csv
```
## CSV Columns
| Column | Description | Example |
|---|---|---|
| `sentence_id` | Globally unique sentence ID | 1 |
| `book_category` | Category of the source book | books-related-to-the-tipitaka |
| `book_name_si` | Sinhala book name (IFBC only) | විශුද්ධිමාර්ගය |
| `book_name_en` | English book name (Tripitaka only) | Digha Nikaya |
| `source` | Data source | IFBC or Tripitaka |
| `text` | The sentence | මා හට අසන්නට ලැබුණේ... |
| `language` | Language classification | sinhala or mixed |
## Metadata Structure
The `metadata/` folder contains two sub-folders: `pdf/` and `tripitaka/`.
### `metadata/pdf/`
Holds statistics and manifest data for the 16 digitised Buddhist books in the PDF corpus.
- `corpus_manifest.json` — lists each book with its name (Sinhala and English), category, and file paths.
- `corpus_statistics.json` — high-level summary: total books (16), total pages (7,064), language split (Sinhala vs. mixed), and category distribution.
- `detailed_corpus_statistics.json` — per-book and per-category breakdown including word counts, character counts, and averages per page. Covers three categories: `books-related-to-the-tipitaka`, `old-books`, and `buddhist-characters`.
### `metadata/tripitaka/`
Contains scraped sutta data from [tripitaka.online](https://tripitaka.online), organised by nikaya. Each nikaya has its own sub-folder (e.g., `digha/`, `majjhima/`, `anguttara/`).
Inside each sub-folder:
- `suttas_batch_{number}.json` — batched sutta records. Each entry contains the URL, title, Sinhala content, Pali content, word counts, nikaya info, scraping method, and timestamp.
- `error_log.json` — records any suttas that failed to scrape.
- `scraping_progress.json` — tracks how many suttas were scraped vs. errored.
## Quick Start
```python
from datasets import load_dataset
# Load plain text for model training
sinhala_ds = load_dataset("RaniduG/SiPaKosa", "sinhala")
print(sinhala_ds["train"][0])
mixed_ds = load_dataset("RaniduG/SiPaKosa", "mixed")
print(mixed_ds["train"][0])
# Load with metadata for filtering by book or source
sinhala_meta = load_dataset("RaniduG/SiPaKosa", "sinhala_metadata")
print(sinhala_meta["train"][0])
# Filter by source
import pandas as pd
df = pd.DataFrame(sinhala_meta["train"])
ifbc_only = df[df["source"] == "IFBC"]
tripitaka_only = df[df["source"] == "Tripitaka"]
# Filter by book
book_df = df[df["book_name_si"] == "විශුද්ධිමාර්ගය"]
```
## Documentation
- [Citation](docs/CITATION.bib) - How to cite this work
## License
This dataset is released under MIT for research purposes.
## Paper
**https://arxiv.org/abs/2603.29221**
许可证:MIT协议
任务类别:
- 文本生成
- 文本分类
语言:
- 僧伽罗语(Sinhala)
友好名称:SiPaKosa
数据规模:100,000 < 样本量 < 1,000,000
配置项:
- 配置名称:sinhala_metadata
数据文件:
- 划分:训练集
路径:data/sinhala/train.csv
- 划分:验证集
路径:data/sinhala/validation.csv
- 划分:测试集
路径:data/sinhala/test.csv
- 配置名称:mixed_metadata
数据文件:
- 划分:训练集
路径:data/mixed/train.csv
- 划分:验证集
路径:data/mixed/validation.csv
- 划分:测试集
路径:data/mixed/test.csv
# SiPaKosa:僧伽罗语-巴利语佛教语料库
本语料库为僧伽罗语(Sinhala)与巴利语(Pali)的佛教经典及古典文本合集,数据源自历史档案与网络爬取的佛教经典经文。
本数据集为[SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa-Sent)的**句子级**版本。原版SiPaKosa仅包含书籍级文本,而本数据集则按书籍拆分出所有句子。
**相关数据集(书籍级):** [RaniduG/SiPaKosa](https://huggingface.co/datasets/RaniduG/SiPaKosa)
## 数据集统计
- 总句子量:786,344
- 僧伽罗语句子:465,539(占比59.2%)
- 僧伽罗语-巴利语混合句子:320,805(占比40.8%)
- 数据来源:16部历史典籍(IFBC)与5部尼柯耶(三藏(Tripitaka))
## 数据集配置
| 配置名称 | 格式 | 列名 | 适用场景 |
|---|---|---|---|
| `sinhala` | 纯文本(TXT) | 仅`text`列 | 模型训练 |
| `mixed` | 纯文本(TXT) | 仅`text`列 | 模型训练 |
| `sinhala_metadata` | 逗号分隔值文件(CSV) | `sentence_id`、`book_category`、`book_name_si`、`book_name_en`、`source`、`text`、`language` | 按书籍或来源筛选 |
| `mixed_metadata` | 逗号分隔值文件(CSV) | `sentence_id`、`book_category`、`book_name_si`、`book_name_en`、`source`、`text`、`language` | 按书籍或来源筛选 |
## 数据集结构
data/
├── sinhala/
│ ├── train.txt
│ ├── train.csv
│ ├── validation.txt
│ ├── validation.csv
│ ├── test.txt
│ └── test.csv
└── mixed/
├── train.txt
├── train.csv
├── validation.txt
├── validation.csv
├── test.txt
└── test.csv
## CSV列说明
| 列名 | 说明 | 示例 |
|---|---|---|
| `sentence_id` | 全局唯一句子标识符 | 1 |
| `book_category` | 源书籍类别 | books-related-to-the-tipitaka |
| `book_name_si` | 僧伽罗语书籍名称(仅IFBC来源) | විශුද්ධිමාර්ගය |
| `book_name_en` | 英语书籍名称(仅三藏来源) | Digha Nikaya |
| `source` | 数据来源 | IFBC 或 Tripitaka |
| `text` | 句子文本 | මා හට අසන්නට ලැබුණේ... |
| `language` | 语言分类 | sinhala 或 mixed |
## 元数据结构
`metadata/`文件夹包含两个子文件夹:`pdf/`与`tripitaka/`。
### `metadata/pdf/`
存放16部数字化佛教典籍的统计信息与清单数据。
- `corpus_manifest.json`:列出每部典籍的名称(僧伽罗语与英语)、类别及文件路径。
- `corpus_statistics.json`:高级汇总信息:总典籍数(16部)、总页数(7,064)、语言分布(僧伽罗语与混合语)及类别分布。
- `detailed_corpus_statistics.json`:按典籍与类别细分的统计数据,包含词数、字符数及单页平均值。涵盖三个类别:`books-related-to-the-tipitaka`、`old-books`与`buddhist-characters`。
### `metadata/tripitaka/`
包含从[tripitaka.online](https://tripitaka.online)爬取的经文数据,按尼柯耶组织。每个尼柯耶拥有独立子文件夹(如`digha/`、`majjhima/`、`anguttara/`)。
每个子文件夹内包含:
- `suttas_batch_{number}.json`:批量经文记录。每条记录包含URL、标题、僧伽罗语内容、巴利语内容、词数、尼柯耶信息、爬取方式与时间戳。
- `error_log.json`:记录爬取失败的经文。
- `scraping_progress.json`:追踪已爬取与失败的经文数量。
## 快速上手
python
from datasets import load_dataset
# 加载纯文本数据集用于模型训练
sinhala_ds = load_dataset("RaniduG/SiPaKosa", "sinhala")
print(sinhala_ds["train"][0])
mixed_ds = load_dataset("RaniduG/SiPaKosa", "mixed")
print(mixed_ds["train"][0])
# 加载带元数据的数据集,用于按书籍或来源筛选
sinhala_meta = load_dataset("RaniduG/SiPaKosa", "sinhala_metadata")
print(sinhala_meta["train"][0])
# 按来源筛选
import pandas as pd
df = pd.DataFrame(sinhala_meta["train"])
ifbc_only = df[df["source"] == "IFBC"]
tripitaka_only = df[df["source"] == "Tripitaka"]
# 按书籍筛选
book_df = df[df["book_name_si"] == "විශුද්ධිමාර්ගය"]
## 文档说明
- [引用格式](docs/CITATION.bib):本研究的引用方式
## 许可证
本数据集以MIT协议发布,仅用于科研用途。
## 论文链接
**https://arxiv.org/abs/2603.29221**
提供机构:
RaniduG



