five

NaolBM/amharic_bible_corpus

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/NaolBM/amharic_bible_corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: verse_id dtype: string - name: book dtype: string - name: chapter dtype: string - name: verse dtype: int64 - name: verse_text dtype: string - name: text dtype: string splits: - name: train num_bytes: 5488640 num_examples: 30752 download_size: 2000000 dataset_size: 5488640 configs: - config_name: default data_files: - split: train path: data/train-* language: - am license: apache-2.0 task_categories: - text-generation - fill-mask - text-classification size_categories: - 10K<n<100K pretty_name: + --- # Amharic Bible Corpus Dataset This is an open Protestant Amharic Bible corpus dataset for LLM pretraining and NLP research. ## Dataset Description This dataset contains the complete Amharic Bible text, formatted for language model pretraining. Each entry is a single Bible verse with its reference in the format: `Book Chapter:Verse Verse text`. ### Features - `text`: Complete verse text with book, chapter, and verse reference - Format: `Book Chapter:Verse Verse text` ### Dataset Structure The dataset contains a single split: - **train**: 30,752 Bible verses for training language models ### Languages - Primary language: Amharic (am) - Script: Ge'ez/Ethiopic script - Language code: am-ET ### Statistics | Metric | Count | |--------|-------| | Total Books | 66 | | Total Chapters | 1,189 | | Total Verses | 30,752 | | File Size | 5.23 MB | | Format | Parquet | ### Sample Verses 1. **ኦሪት ዘፍጥረት 1:1** - በመጀመሪያ እግዚአብሔር ሰማይንና ምድርን ፈጠረ። 2. **ኦሪት ዘፍጥረት 1:2** - ምድርም ባዶ ነበረች፥ አንዳችም አልነበረባትም፤ ጨለማም በጥልቁ ላይ ነበረ፤ የእግዚአብሔርም መንፈስ በውኃ ላይ ሰፍፎ ነበር። 3. **ኦሪት ዘፍጥረት 1:3** - እግዚአብሔርም። ብርሃን ይሁን ኣለ፤ ብርሃንም ሆነ። 4. **ኦሪት ዘፍጥረት 1:4** - እግዚአብሔርም ብርሃኑ መልካም እንደ ሆነ አየ፤ እግዚብሔርም ብርሃንንና ጨለማን ለየ። 5. **ኦሪት ዘፍጥረት 1:5** - እግዚአብሔርም ብርሃኑን ቀን ብሎ ጠራው፥ ጨለማውንም ሌሊት አለው። ማታም ሆነ ጥዋትም ሆነ፥ አንድ ቀን። ### Book Summary (First 10 Books) | Book | Chapters | Verses | |------|----------|--------| | ኦሪት ዘፍጥረት | 50 | 1,530 | | ኦሪት ዘጸአት | 40 | 1,207 | | ኦሪት ዘሌዋውያን | 27 | 855 | | ኦሪት ዘኍልቍ | 36 | 1,276 | | ኦሪት ዘዳግም | 34 | 934 | | መጽሐፈ ኢያሱ ወልደ ነዌ | 24 | 636 | | መጽሐፈ መሣፍንት | 21 | 616 | | መጽሐፈ ሩት | 4 | 85 | | መጽሐፈ ሳሙኤል ቀዳማዊ | 31 | 809 | | መጽሐፈ ሳሙኤል ካል | 24 | 691 | *And 56 more books...* ### Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("NaolBM/amharic_bible_corpus") # Access the training data train_data = dataset["train"] # Iterate through verses for example in train_data: print(example["text"]) # Example: "ኦሪት ዘፍጥረት 1:1 በመጀመሪያ እግዚአብሔር ሰማይንና ምድርን ፈጠረ።" # Get statistics print(f"Total verses: {len(train_data)}")
提供机构:
NaolBM
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作