FriezaForce/tv2en-raw-aligned
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/FriezaForce/tv2en-raw-aligned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- translation
language:
- tvl
- en
size_categories:
- 100K<n<1M
source_datasets:
- jw_org_wol
---
# TV2EN: Tuvaluan-English Parallel Corpus
## Dataset Summary
Raw aligned Tuvaluan-English parallel corpus (before cleaning)
## Dataset Details
### Data Fields
- **tvl** (string): Text in Tuvaluan
- **en** (string): Text in English
- **domain** (string): Source domain (bible, book/article, daily_text)
- **content_type** (string): Content type (bible_verse, article_paragraph, daily_text)
- **doc_id** (string, optional): Document identifier from JW.org
- **date** (string, optional): Date for daily texts (YYYY-MM-DD format)
### Data Size
- **Total pairs**: 309,700 pairs
- **Languages**: Tuvaluan (TVL) ↔ English (EN)
- **Domains**: Biblical texts, religious articles, daily devotional content
### Data Quality
**Cleaning Pipeline:**
- Removal of duplicate entries (by ID and content)
- Filtering of malformed entries
- Validation of language pair alignment
- Metadata consistency checks
**Quality Metrics:**
- Parse success rate: >99%
- Duplicate removal: 131K entries removed
- Final retention: 178K high-quality pairs
## Source
All text sourced from Watch Tower Library Online (JW.org/wol):
- **Bible**: 30,838 verse-aligned pairs across 66 books
- **Articles**: 275,430 paragraph-aligned pairs from publications
- **Daily Text**: 3,432 date-aligned devotional pairs (2017-2025)
## Language Coverage
**Tuvaluan (TVL)**
- Native speakers: ~11,000
- Classification: Low-resource Polynesian language
- WOL locale: `lp-vl` (VL, not TVL)
**English (EN)**
- Native content from JW.org English publications
- WOL locale: `lp-e`
## Licensing & Attribution
- **License**: CC-BY-4.0
- **Source**: Watch Tower Bible and Tract Society of Pennsylvania
- **Attribution**: JW.org Watch Tower Library Online (https://www.jw.org/)
## Ethical Considerations
- Content sourced from religious publications
- Reflects Watchtower theological positions
- Suitable for low-resource NLP research
## Citation
```bibtex
@dataset{tv2en_corpus,
title={TV2EN: Tuvaluan-English Parallel Corpus},
author{FriezaForce},
year{2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/FriezaForce/tv2en-cleaned}
}
```
## Dataset Statistics
- **Min/max sentence length**: Variable (typically 5-200 words per side)
- **Alignment quality**: High (verified manual sampling)
- **Temporal coverage**: 2017-2025 for daily texts
- **Publication coverage**: 22+ publication codes
## Suggested Use Cases
- Machine translation (Tuvaluan ↔ English)
- Low-resource NLP research
- Multilingual model adaptation
- Cross-lingual transfer learning
提供机构:
FriezaForce



