addisai/wikipedia-amharic
收藏Hugging Face2025-12-22 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/addisai/wikipedia-amharic
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- am
- en
license: apache-2.0
task_categories:
- translation
- text-generation
- question-answering
- summarization
tags:
- wikipedia
- amharic
- ethiopian-languages
- addis-ai
- multilingual
- knowledge-base
- encyclopedic
pretty_name: Wikipedia Amharic (አማርኛ ዊኪፒዲያ)
size_categories:
- 10K<n<100K
---
# Wikipedia Amharic (አማርኛ ዊኪፒዲያ)
<div align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/80/Wikipedia-logo-v2.svg/200px-Wikipedia-logo-v2.svg.png" alt="Wikipedia Logo" width="100"/>
### High-Quality Amharic Wikipedia Translations
**Translated by Addis AI - Aleph (፩)**
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/datasets/addisai/wikipedia-amharic)
</div>
---
## Dataset Description
This dataset contains **Wikipedia articles professionally translated from English to Amharic (አማርኛ)** using **Addis AI - Aleph (፩)**, a state-of-the-art translation model specifically optimized for Ethiopian languages. This represents one of the largest and highest-quality Amharic knowledge bases available, providing encyclopedic content across diverse topics.
### Key Features
- 🌍 **Comprehensive Coverage**: Thousands of Wikipedia articles spanning diverse topics
- 🎯 **High-Quality Translation**: Professional-grade neural translation using Addis AI technology
- 📚 **Parallel Corpus**: Both original English and Amharic translations provided
- 🔗 **Fully Linked**: Complete with original Wikipedia URLs and IDs for reference
- ✅ **Quality Assured**: Multiple validation steps ensure translation accuracy
- 🇪🇹 **Cultural Sensitivity**: Translations maintain Ethiopian context and cultural appropriateness
### Dataset Summary
| Feature | Details |
|---------|---------|
| **Source** | Wikimedia Wikipedia (20231101.en) |
| **Target Language** | Amharic (አማርኛ) |
| **Translation Model** | Addis AI - Aleph (፩) |
| **Provider** | [Addis AI](platform.addisassistant.com) |
| **License** | Apache 2.0 |
| **Format** | JSONL with parallel text |
| **Created** | 2025 |
### Use Cases
This dataset is ideal for:
- 🤖 **Language Model Training**: Train or fine-tune Amharic language models
- 🔄 **Translation Systems**: Build English-Amharic translation models
- ❓ **Question Answering**: Create Amharic QA systems with factual knowledge
- 📖 **Knowledge Base Construction**: Build Amharic encyclopedic resources
- 🔬 **Cross-lingual Research**: Study multilingual understanding and transfer learning
- 📚 **Educational Applications**: Develop learning tools for Amharic speakers
- 🌐 **Information Retrieval**: Build Amharic search and retrieval systems
---
## Dataset Structure
### Data Fields
Each article in the dataset contains the following fields:
| Field Name | Type | Description |
|------------|------|-------------|
| `id` | string | Unique Wikipedia article identifier |
| `url` | string | Original Wikipedia article URL |
| `title_original` | string | Original English article title |
| `title_amharic` | string | Translated Amharic article title |
| `text_original` | string | Full original English article content |
| `text_amharic` | string | Full translated Amharic article content |
| `translation_metadata` | dict | Comprehensive translation metadata (see below) |
### Translation Metadata Structure
```json
{
"translated_at": 1758614227.850183,
"model_used": "Addis AI - Aleph (፩)",
"provider": "AddisAI",
"source_lang": "English",
"target_lang": "Amharic",
"translated_fields": ["title", "text"],
"dataset_version": "1.0"
}
```
### Data Example
```json
{
"id": "590",
"url": "https://en.wikipedia.org/wiki/Austin%20%28disambiguation%29",
"title_original": "Austin (disambiguation)",
"title_amharic": "አውስተን (የማያሻማ)",
"text_original": "Austin is the capital of Texas in the United States.\n\nAustin may also refer to:\n\nGeographical locations...",
"text_amharic": "ኦስቲን የዩናይትድ ስቴትስ የቴክሳስ ግዛት ዋና ከተማ ናት።\n\nኦስቲን ደግሞ የሚከተሉትን ሊያመለክት ይችላል፦\n\nጂኦግራፊያዊ ቦታዎች...",
"translation_metadata": {
"translated_at": 1758614227.850183,
"model_used": "Addis AI - Aleph (፩)",
"provider": "AddisAI",
"source_lang": "English",
"target_lang": "Amharic",
"translated_fields": ["title", "text"],
"dataset_version": "1.0"
}
}
```
### Data Splits
Currently, the dataset is provided as a single training split. Users can create their own validation/test splits as needed:
```python
from datasets import load_dataset
dataset = load_dataset("addisai/wikipedia-amharic")
# Create your own splits
dataset = dataset["train"].train_test_split(test_size=0.1)
train_data = dataset["train"]
test_data = dataset["test"]
```
---
## Dataset Creation
### Source Data
The source data originates from the **[Wikimedia Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia)**, specifically the **November 1, 2023 English snapshot (20231101.en)**. This represents a comprehensive snapshot of English Wikipedia at that point in time.
### Translation Process
This dataset was created using a sophisticated multi-stage translation pipeline designed specifically for handling encyclopedic content:
#### 1. **Translation Model: Addis AI - Aleph (፩)**
The translation was performed using **Addis AI - Aleph (፩)**, a state-of-the-art neural translation model with the following characteristics:
- **Ethiopian Language Specialization**: Specifically optimized for Amharic and other Ethiopian languages
- **Cultural Awareness**: Trained to understand Ethiopian context, idioms, and cultural references
- **Technical Precision**: Handles technical terminology, proper nouns, and specialized vocabulary accurately
- **Ge'ez Script Mastery**: Native support for Amharic script (Ge'ez/Fidel) with proper character handling
- **Context-Aware**: Maintains document-level context for coherent long-form translation
- **Register Sensitivity**: Appropriately handles formal, informal, and technical registers
#### 2. **Translation Pipeline**
The translation process involved:
1. **Document Preprocessing**
- Structure preservation (headings, lists, formatting)
- URL and reference handling
- Special character normalization
2. **Context-Aware Translation**
- Maintains article structure and flow
- Preserves Wikipedia formatting conventions
- Handles infoboxes and tables appropriately
- Context-aware chunking for very long articles
3. **Quality Assurance**
- Automated validation checks
- Length consistency verification
- Format preservation validation
- Metadata completeness checks
- Error detection and recovery
4. **Post-Processing**
- Final formatting verification
- Metadata enrichment
- Quality scoring
- Completeness validation
#### 3. **Technical Specifications**
- **Model Temperature**: 0.3 (for consistent, factual output)
- **Max Tokens**: 6000 per translation unit
- **Chunking Strategy**: Sentence-aware splitting for long articles
- **Parallel Processing**: Efficient batch processing with error recovery
- **API Infrastructure**: Google Generative AI Platform
### Quality Control
Multiple quality control measures ensure high translation quality:
✅ **Automated Validation**
- Empty translation detection
- Length ratio verification (prevents truncation)
- Character encoding validation
- JSON structure integrity checks
✅ **Error Recovery**
- Automatic retry with exponential backoff
- Failed translation logging and recovery
- Checkpoint-based progress tracking
✅ **Metadata Tracking**
- Complete translation provenance
- Timestamp recording
- Model version tracking
- Quality indicators
---
## Usage
### Loading the Dataset
#### Basic Loading
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("addisai/wikipedia-amharic")
# Access articles
articles = dataset["train"]
print(f"Total articles: {len(articles):,}")
# Get a specific article
article = articles[0]
print(f"English Title: {article['title_original']}")
print(f"Amharic Title: {article['title_amharic']}")
print(f"Amharic Content: {article['text_amharic'][:200]}...")
```
#### Streaming for Large-Scale Processing
For memory-efficient processing of large datasets:
```python
from datasets import load_dataset
# Use streaming to avoid loading entire dataset into memory
dataset = load_dataset("addisai/wikipedia-amharic", streaming=True)
# Process articles one by one
for article in dataset["train"]:
amharic_title = article["title_amharic"]
amharic_text = article["text_amharic"]
# Your processing here
print(f"Processing: {amharic_title}")
```
### Common Use Cases
#### 1. Language Model Pre-training/Fine-tuning
```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
# Load dataset
dataset = load_dataset("addisai/wikipedia-amharic")
# Load or train Amharic tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-amharic-tokenizer")
# Tokenize the Amharic text
def tokenize_function(examples):
return tokenizer(examples["text_amharic"], truncation=True, max_length=512)
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names)
# Train your model
model = AutoModelForCausalLM.from_pretrained("your-base-model")
trainer = Trainer(
model=model,
args=TrainingArguments(output_dir="./amharic-lm"),
train_dataset=tokenized_dataset["train"]
)
trainer.train()
```
#### 2. Translation Model Training
```python
# Extract parallel sentences for translation model training
from datasets import load_dataset
dataset = load_dataset("addisai/wikipedia-amharic")
# Create parallel corpus
parallel_corpus = []
for article in dataset["train"]:
parallel_corpus.append({
"en": article["text_original"],
"am": article["text_amharic"]
})
# Use for training translation models (e.g., MarianMT, mBART, etc.)
```
#### 3. Question Answering Dataset Creation
```python
# Use Wikipedia articles as context for QA
from datasets import load_dataset
dataset = load_dataset("addisai/wikipedia-amharic")
for article in dataset["train"]:
context = article["text_amharic"]
title = article["title_amharic"]
# Generate questions about the content
# Create QA pairs using the Amharic context
# Train extractive QA models
```
#### 4. Information Retrieval and Search
```python
# Build an Amharic search index
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
dataset = load_dataset("addisai/wikipedia-amharic")
model = SentenceTransformer('your-multilingual-model')
# Create embeddings for Amharic articles
embeddings = []
for article in dataset["train"]:
embedding = model.encode(article["text_amharic"])
embeddings.append({
"title": article["title_amharic"],
"embedding": embedding,
"url": article["url"]
})
# Use for semantic search over Amharic content
```
#### 5. Cross-lingual Research
```python
# Study translation patterns and linguistic phenomena
from datasets import load_dataset
import statistics
dataset = load_dataset("addisai/wikipedia-amharic")
# Analyze translation characteristics
length_ratios = []
for article in dataset["train"]:
en_length = len(article["text_original"])
am_length = len(article["text_amharic"])
length_ratios.append(am_length / en_length)
print(f"Average length ratio (AM/EN): {statistics.mean(length_ratios):.2f}")
print(f"Median length ratio: {statistics.median(length_ratios):.2f}")
```
#### 6. Building Amharic Knowledge Graphs
```python
# Extract entities and relationships from Amharic text
from datasets import load_dataset
dataset = load_dataset("addisai/wikipedia-amharic")
knowledge_base = []
for article in dataset["train"]:
# Extract named entities from Amharic text
# Build entity relationships
# Create knowledge graph nodes and edges
knowledge_base.append({
"title": article["title_amharic"],
"url": article["url"],
"text": article["text_amharic"]
})
```
---
## Dataset Statistics
### Content Analysis
- **Topics Covered**: Diverse range including history, science, geography, culture, technology, arts, and more
- **Article Types**: Regular articles, disambiguation pages, lists, and category pages
- **Language Coverage**: Complete parallel corpus with both English and Amharic
- **Average Article Length**: Varies from short disambiguation pages to comprehensive articles (see metadata for specifics)
### Quality Metrics
- ✅ **Translation Completeness**: All articles fully translated (no truncation)
- ✅ **Format Preservation**: Wikipedia structure maintained (headings, lists, links)
- ✅ **Metadata Integrity**: Complete provenance tracking for all translations
- ✅ **Character Encoding**: Proper Ge'ez script encoding validated
- ✅ **Error Rate**: Comprehensive error logging and recovery applied
---
## Limitations and Considerations
### Known Limitations
1. **Translation Artifacts**
- As with any machine translation, some nuances may be lost or altered
- Idioms and culturally-specific phrases may not translate perfectly
- Some translations may sound more formal than natural spoken Amharic
2. **Technical Terminology**
- Highly specialized technical terms may use transliteration
- Some scientific names kept in original form (common practice)
- New or emerging concepts may lack established Amharic equivalents
3. **Cultural Context**
- Source content from English Wikipedia has Western-centric bias
- Cultural references may not resonate equally with Ethiopian audience
- Geographic and historical content may emphasize non-Ethiopian topics
4. **Temporal Coverage**
- Content reflects Wikipedia as of November 2023
- Recent events or updates after this date not included
- Some rapidly-changing information may be outdated
5. **Regional Variations**
- Uses standard Amharic orthography
- Regional dialects and variations not extensively covered
- Formal register may differ from colloquial usage
### Source Bias Considerations
**Wikipedia Content Bias**:
- English Wikipedia has known systemic biases (Western-centric, male-dominated, etc.)
- Geographic coverage skewed toward Western countries
- Historical narratives may reflect Eurocentric perspectives
- These biases are inherited by the translation
**Translation Model Bias**:
- Model trained to maintain factual accuracy and cultural sensitivity
- Attempts to preserve neutral, encyclopedic tone
- May not fully adapt all cultural contexts to Ethiopian norms
- Users should apply critical thinking when using content
### Ethical Considerations
**Recommended Uses** ✅:
- Educational and research purposes
- Training NLP models for Amharic
- Building knowledge bases and reference materials
- Developing language technology tools
- Cross-lingual information access
**Use with Caution** ⚠️:
- Factual claims in critical domains (verify with experts)
- Medical or legal information (requires professional validation)
- Cultural or religious content (consider cultural review)
- Historical narratives (be aware of perspective biases)
**Not Recommended** ❌:
- Sole authoritative source for critical decisions
- Medical diagnosis or treatment decisions
- Legal advice or proceedings
- As-is content for publication without review
---
## License and Attribution
### License
This dataset is released under the **Apache License 2.0**.
**Key terms**:
- ✅ Commercial use allowed
- ✅ Modification allowed
- ✅ Distribution allowed
- ✅ Patent use allowed
- ⚠️ Must provide attribution
- ⚠️ Must include license copy
- ⚠️ Must state changes made
Full license: [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
### Attribution Requirement ⚠️ IMPORTANT
**If you use this dataset in any research, application, product, or publication, you MUST provide attribution to Addis AI.**
#### Recommended Attribution Text:
```
This work uses the Wikipedia Amharic dataset, translated by Addis AI
using the Aleph (፩) translation model.
Dataset: https://huggingface.co/datasets/addisai/wikipedia-amharic
Provider: Addis AI (platform.addisassistant.com)
```
#### For Academic Papers:
```bibtex
@dataset{wikipedia_amharic_2025,
title={Wikipedia Amharic: High-Quality Amharic Translations of Wikipedia},
author={{Addis AI}},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/addisai/wikipedia-amharic}},
note={Translated using Addis AI - Aleph (፩) translation model}
}
```
#### For Software/Applications:
Include in your README, documentation, or about page:
```markdown
## Data Attribution
This application uses Wikipedia Amharic dataset by Addis AI.
- Dataset: https://huggingface.co/datasets/addisai/wikipedia-amharic
- Translation: Addis AI - Aleph (፩)
- Provider: platform.addisassistant.com
```
### Source Attribution
Original content sourced from:
- **Wikipedia**: Licensed under CC BY-SA 3.0
- **Wikimedia Foundation**: https://www.wikimedia.org
---
## Contact and Support
### Get in Touch
- 📧 **Email**: contact@addisassistant.com
- 🌐 **Website**: [https://addisassistant.com](https://addisassistant.com)
- 💬 **Issues & Discussions**: Use Hugging Face dataset discussion board
### Feedback Welcome
We value your feedback and contributions:
- 🐛 **Report Issues**: Found translation errors or problems? Let us know!
- 💡 **Suggestions**: Ideas for improvement? We're listening!
- 📊 **Use Cases**: Share how you're using the dataset
- 🤝 **Collaborate**: Interested in Ethiopian language tech? Contact us!
### Support the Project
Help us expand Amharic and Ethiopian language resources:
- ⭐ Star the dataset on Hugging Face
- 📢 Share with researchers and developers
- 🔗 Cite in your publications
- 🤝 Collaborate on future datasets
---
## Acknowledgments
### Credits
**Addis AI Team**
- Development of Aleph (፩) translation model
- Dataset creation and quality assurance
- Infrastructure and processing pipeline
**Wikimedia Foundation & Wikipedia Community**
- Original encyclopedic content
- Maintaining free knowledge infrastructure
- Open licensing enabling projects like this
**Hugging Face**
- Dataset hosting and distribution platform
- Tools and infrastructure for dataset sharing
- Community support
**Ethiopian NLP Community**
- Ongoing support and feedback
- Advocacy for Ethiopian language technology
- Collaboration and knowledge sharing
### Technology Stack
- **Translation**: Addis AI - Aleph (፩)
- **API Platform**: Google Generative AI
- **Processing**: Python with parallel processing
- **Dataset Format**: Hugging Face Datasets library
- **Storage**: Hugging Face Hub
---
## Related Resources
### Addis AI Datasets
Explore more Ethiopian language resources:
- 🗣️ Amharic conversational datasets
- 📚 Multilingual Ethiopian language collections
- 🎓 Domain-specific translated datasets
- 📖 Instruction-following datasets in Amharic
### Ethiopian Language Technology
- 🌍 [Amharic Wikipedia](https://am.wikipedia.org) - Native Amharic content
- 📝 [Ethiopian Languages on Hugging Face](https://huggingface.co/models?language=am)
- ✍️ Ge'ez script resources and tools
- 🔤 Amharic NLP tools and libraries
### Learn More
- 📖 [About Addis AI](https://addisassistant.com)
- 🤖 [Aleph (፩) Model](https://addisassistant.com/aleph)
- 📊 [Ethiopian Language Statistics](https://en.wikipedia.org/wiki/Amharic)
---
## Version History
### Version 1.0 (2025)
**Initial Release**
- ✅ Complete Wikipedia translation from November 2023 snapshot
- ✅ Professional-grade translations using Addis AI - Aleph (፩)
- ✅ Comprehensive metadata and provenance tracking
- ✅ Quality validation and error recovery
- ✅ Parallel corpus with both English and Amharic
- ✅ Full Wikipedia structure preservation
**Features**:
- High-quality neural translation
- Context-aware long-form translation
- Complete metadata tracking
- Error recovery and quality assurance
- Apache 2.0 licensing
**Statistics**:
- Thousands of articles translated
- Multiple topic areas covered
- Complete parallel corpus
- Full metadata for all articles
---
## FAQ
### General Questions
**Q: How accurate are these translations?**
A: The translations use state-of-the-art neural translation (Addis AI - Aleph ፩) optimized for Amharic. Quality is professional-grade, though users should verify critical information with domain experts.
**Q: Can I use this for commercial purposes?**
A: Yes! Apache 2.0 license allows commercial use. Just remember to provide attribution to Addis AI.
**Q: How recent is the content?**
A: Content reflects Wikipedia as of November 1, 2023. For current events, consult more recent sources.
**Q: Are all Wikipedia articles included?**
A: This dataset contains a substantial selection of Wikipedia articles. Not all articles may be included.
### Technical Questions
**Q: What's the file format?**
A: JSONL (JSON Lines) format, with one article per line. Easily loadable with Hugging Face datasets library.
**Q: How large is the dataset?**
A: Size varies by number of articles. Use streaming mode for memory-efficient processing of large datasets.
**Q: Can I create my own train/test splits?**
A: Yes! The dataset is provided as a single split. Create your own splits using Hugging Face datasets methods.
### Usage Questions
**Q: How do I cite this dataset?**
A: Use the BibTeX citation provided in the "Citation" section above.
**Q: Can I redistribute this dataset?**
A: Yes, under Apache 2.0 terms. Include attribution and license information.
**Q: Can I modify the translations?**
A: Yes, modifications are allowed under Apache 2.0. State changes made and maintain attribution.
---
<div align="center">
## 🌟 Support Ethiopian Language Technology
This dataset is part of Addis AI's mission to democratize AI technology for Ethiopian languages and empower Amharic speakers with advanced language technology.
### What This Dataset Enables
🇪🇹 **Language Preservation**: Digital resources for Amharic language
📚 **Education**: Learning materials for millions of Amharic speakers
🤖 **AI Advancement**: Training data for Ethiopian language models
🌍 **Global Access**: Breaking language barriers for knowledge access
💡 **Innovation**: Foundation for new Ethiopian language applications
---
### How You Can Help
⭐ **Star this dataset** on Hugging Face
📢 **Share** with researchers and developers
🔗 **Cite** in your academic publications
💬 **Provide feedback** to improve future versions
🤝 **Collaborate** on Ethiopian language technology
---
**Created with ❤️ by [Addis AI](https://addisassistant.com)**
*Empowering Ethiopian Languages with Advanced AI Technology*
### Addis AI - Aleph (፩)
**The First of its Kind**
*Advanced translation technology specifically designed for Ethiopian languages, combining state-of-the-art AI with deep cultural understanding.*
---
**📧 Contact**: contact@addisassistant.com
**🌐 Website**: https://platform.addisassistant.com
**🤗 Hugging Face**: https://huggingface.co/addisai
---
**Dataset Created**: 2025
**Last Updated**: 2025
**License**: Apache 2.0
**Version**: 1.0
</div>
---
**🙏 Remember to attribute Addis AI when using this dataset!**
*ስለተጠቀሙት እናመሰግናለን! (Thank you for using our dataset!)*
提供机构:
addisai



