NaolBM/amharic_bible_corpus
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/NaolBM/amharic_bible_corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: verse_id
dtype: string
- name: book
dtype: string
- name: chapter
dtype: string
- name: verse
dtype: int64
- name: verse_text
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5488640
num_examples: 30752
download_size: 2000000
dataset_size: 5488640
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- am
license: apache-2.0
task_categories:
- text-generation
- fill-mask
- text-classification
size_categories:
- 10K<n<100K
pretty_name: +
---
# Amharic Bible Corpus Dataset
This is an open Protestant Amharic Bible corpus dataset for LLM pretraining and NLP research.
## Dataset Description
This dataset contains the complete Amharic Bible text, formatted for language model pretraining. Each entry is a single Bible verse with its reference in the format: `Book Chapter:Verse Verse text`.
### Features
- `text`: Complete verse text with book, chapter, and verse reference
- Format: `Book Chapter:Verse Verse text`
### Dataset Structure
The dataset contains a single split:
- **train**: 30,752 Bible verses for training language models
### Languages
- Primary language: Amharic (am)
- Script: Ge'ez/Ethiopic script
- Language code: am-ET
### Statistics
| Metric | Count |
|--------|-------|
| Total Books | 66 |
| Total Chapters | 1,189 |
| Total Verses | 30,752 |
| File Size | 5.23 MB |
| Format | Parquet |
### Sample Verses
1. **ኦሪት ዘፍጥረት 1:1** - በመጀመሪያ እግዚአብሔር ሰማይንና ምድርን ፈጠረ።
2. **ኦሪት ዘፍጥረት 1:2** - ምድርም ባዶ ነበረች፥ አንዳችም አልነበረባትም፤ ጨለማም በጥልቁ ላይ ነበረ፤ የእግዚአብሔርም መንፈስ በውኃ ላይ ሰፍፎ ነበር።
3. **ኦሪት ዘፍጥረት 1:3** - እግዚአብሔርም። ብርሃን ይሁን ኣለ፤ ብርሃንም ሆነ።
4. **ኦሪት ዘፍጥረት 1:4** - እግዚአብሔርም ብርሃኑ መልካም እንደ ሆነ አየ፤ እግዚብሔርም ብርሃንንና ጨለማን ለየ።
5. **ኦሪት ዘፍጥረት 1:5** - እግዚአብሔርም ብርሃኑን ቀን ብሎ ጠራው፥ ጨለማውንም ሌሊት አለው። ማታም ሆነ ጥዋትም ሆነ፥ አንድ ቀን።
### Book Summary (First 10 Books)
| Book | Chapters | Verses |
|------|----------|--------|
| ኦሪት ዘፍጥረት | 50 | 1,530 |
| ኦሪት ዘጸአት | 40 | 1,207 |
| ኦሪት ዘሌዋውያን | 27 | 855 |
| ኦሪት ዘኍልቍ | 36 | 1,276 |
| ኦሪት ዘዳግም | 34 | 934 |
| መጽሐፈ ኢያሱ ወልደ ነዌ | 24 | 636 |
| መጽሐፈ መሣፍንት | 21 | 616 |
| መጽሐፈ ሩት | 4 | 85 |
| መጽሐፈ ሳሙኤል ቀዳማዊ | 31 | 809 |
| መጽሐፈ ሳሙኤል ካል | 24 | 691 |
*And 56 more books...*
### Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("NaolBM/amharic_bible_corpus")
# Access the training data
train_data = dataset["train"]
# Iterate through verses
for example in train_data:
print(example["text"])
# Example: "ኦሪት ዘፍጥረት 1:1 በመጀመሪያ እግዚአብሔር ሰማይንና ምድርን ፈጠረ።"
# Get statistics
print(f"Total verses: {len(train_data)}")
提供机构:
NaolBM



