five

FriezaForce/tv2en-raw-aligned

收藏
Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/FriezaForce/tv2en-raw-aligned
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - translation language: - tvl - en size_categories: - 100K<n<1M source_datasets: - jw_org_wol --- # TV2EN: Tuvaluan-English Parallel Corpus ## Dataset Summary Raw aligned Tuvaluan-English parallel corpus (before cleaning) ## Dataset Details ### Data Fields - **tvl** (string): Text in Tuvaluan - **en** (string): Text in English - **domain** (string): Source domain (bible, book/article, daily_text) - **content_type** (string): Content type (bible_verse, article_paragraph, daily_text) - **doc_id** (string, optional): Document identifier from JW.org - **date** (string, optional): Date for daily texts (YYYY-MM-DD format) ### Data Size - **Total pairs**: 309,700 pairs - **Languages**: Tuvaluan (TVL) ↔ English (EN) - **Domains**: Biblical texts, religious articles, daily devotional content ### Data Quality **Cleaning Pipeline:** - Removal of duplicate entries (by ID and content) - Filtering of malformed entries - Validation of language pair alignment - Metadata consistency checks **Quality Metrics:** - Parse success rate: >99% - Duplicate removal: 131K entries removed - Final retention: 178K high-quality pairs ## Source All text sourced from Watch Tower Library Online (JW.org/wol): - **Bible**: 30,838 verse-aligned pairs across 66 books - **Articles**: 275,430 paragraph-aligned pairs from publications - **Daily Text**: 3,432 date-aligned devotional pairs (2017-2025) ## Language Coverage **Tuvaluan (TVL)** - Native speakers: ~11,000 - Classification: Low-resource Polynesian language - WOL locale: `lp-vl` (VL, not TVL) **English (EN)** - Native content from JW.org English publications - WOL locale: `lp-e` ## Licensing & Attribution - **License**: CC-BY-4.0 - **Source**: Watch Tower Bible and Tract Society of Pennsylvania - **Attribution**: JW.org Watch Tower Library Online (https://www.jw.org/) ## Ethical Considerations - Content sourced from religious publications - Reflects Watchtower theological positions - Suitable for low-resource NLP research ## Citation ```bibtex @dataset{tv2en_corpus, title={TV2EN: Tuvaluan-English Parallel Corpus}, author{FriezaForce}, year{2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/FriezaForce/tv2en-cleaned} } ``` ## Dataset Statistics - **Min/max sentence length**: Variable (typically 5-200 words per side) - **Alignment quality**: High (verified manual sampling) - **Temporal coverage**: 2017-2025 for daily texts - **Publication coverage**: 22+ publication codes ## Suggested Use Cases - Machine translation (Tuvaluan ↔ English) - Low-resource NLP research - Multilingual model adaptation - Cross-lingual transfer learning
提供机构:
FriezaForce
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作