five

ImruQays/Thaqalayn-Classical-Arabic-English-Parallel-texts

收藏
Hugging Face2024-03-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ImruQays/Thaqalayn-Classical-Arabic-English-Parallel-texts
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - translation language: - ar - en size_categories: - 10K<n<100K license: cc-by-4.0 --- # Introduction This dataset represents a comprehensive collection of parallel Arabic-English texts from the Thaqalayn Hadith Library, a premier source for exploring the classical hadith tradition of the Imāmī Shia Muslim school of thought. The library focuses on making primary historical sources accessible, serving as a bridge between past wisdom and contemporary study. The dataset features translations of significant classical Imāmī hadith texts, allowing for a deep dive into the linguistic and cultural heritage of this era. # Content Details The Thaqalayn Hadith Library includes Arabic-English parallel texts from the following classical collections: - Al-Kāfi (The Sufficient) - Muʿjam al-Aḥādīth al-Muʿtabara (A Comprehensive Compilation of Reliable Narrations) - Al-Khiṣāl (The Book of Characteristics) - ʿUyūn akhbār al-Riḍā (The Source of Traditions on Imam al-Riḍā) - Al-Amālī (The Dictations) by Shaykh Muḥammad b. Muḥammad al-Mufīd - Al-Amālī (The Dictations) by Shaykh Muḥammad b. ʿAlī al-Ṣaduq - Al-Tawḥīd (The Book of Divine Unity) - Kitāb al-Ḍuʿafāʾ (The Weakened Ones) - Kitāb al-Ghayba (The Book of Occultation) by Abū ʿAbd Allah Muḥammad b. Ibrāhīm al-Nuʿmānī - Kitāb al-Ghayba (The Book of Occultation) by Shaykh Muḥammad b. al-Ḥasan al-Ṭūsī - Thawāb al-Aʿmāl wa ʿiqāb al-Aʿmāl (The Rewards & Punishments of Deeds) - Kāmil al-Ziyārāt (The Complete Pilgrimage Guide) - Faḍaʾil al-Shīʿa (Virtues of the Shīʿa) - Ṣifāt al-Shīʿa (Attributes of the Shīʿa) - Maʿānī al-ʾAkhbār (The Meanings of Reports) - Kitāb al-Muʾmin (The Book of the Believer) - Kitāb al-Zuhd (The Book of Asceticism) - Nahj al-Balāgha (The Peak of Eloquence) # Purpose and Application The dataset aims to showcase the unmatched literary quality of Classical Arabic, distinguished from Modern Standard Arabic, particularly in its preservation from the European translation trends of the 19th and 20th centuries: - Refinement of Machine Translation (MT): The complex grammatical structures and rich lexicon of Classical Arabic present a unique challenge for MT systems, pushing the boundaries of translation accuracy and fluency. - Development of Language Models: These texts serve as a foundation for training sophisticated Large Language Models (LLMs) capable of understanding and replicating the depth of Classical Arabic. - Preservation of Linguistic Heritage: This dataset preserves the original form of Classical Arabic, providing a standard of excellence against which modern writings can be compared. # Suggested Research Application: Iterative Translation Refinement A notable application of this dataset is the enhancement of contemporary Arabic writing through back-translation. Existing models can back-translate English texts into Arabic, potentially producing a less sophisticated form. This offers an opportunity to: - Generate Imperfect Arabic Corpus: Use back-translation to create a corpus of Arabic text that is less refined than the original Classical Arabic. - Train Refinement Models: Develop models that refine the imperfect Arabic by comparing it to the original texts, aiming to restore the classical eloquence. - Enhance Contemporary Arabic Writing: Apply these models to modern Arabic texts, elevating their literary quality by infusing classical stylistic elements, making the language resonate with its classical roots. # Credits Credits go the [Thaqalayn website](https://thaqalayn.net/) for their compilation of Arabic and English texts. Also, the original webscrape is done by [jenusi](https://github.com/jenusi) on GitHub in this [repo](https://github.com/jenusi/ThaqalaynScraper). I only compiled it in the form of two columns for texts from all books. I also converted the numbers from western Arabic (0123456789) to eastern Arabic (٠١٢٣٤٥٦٧٨٩).
提供机构:
ImruQays
原始信息汇总

数据集概述

任务类别

  • 翻译

语言

  • 阿拉伯语
  • 英语

数据集大小

  • 10K<n<100K

许可证

  • cc-by-4.0

简介

该数据集包含来自Thaqalayn Hadith Library的阿拉伯语-英语平行文本,这是一个探索伊玛目什叶派传统古典圣训的主要来源。图书馆致力于使主要历史资料易于访问,作为连接过去智慧与当代研究的桥梁。数据集包括重要的古典伊玛目圣训文本的翻译,允许深入探索这一时期的语言和文化传承。

内容详情

Thaqalayn Hadith Library包括以下古典文集的阿拉伯语-英语平行文本:

  • Al-Kāfi(足够的)
  • Muʿjam al-Aḥādīth al-Muʿtabara(可靠叙述的综合编纂)
  • Al-Khiṣāl(特征之书)
  • ʿUyūn akhbār al-Riḍā(关于伊玛目al-Riḍā的传统来源)
  • Al-Amālī(口述)by Shaykh Muḥammad b. Muḥammad al-Mufīd
  • Al-Amālī(口述)by Shaykh Muḥammad b. ʿAlī al-Ṣaduq
  • Al-Tawḥīd(神统一之书)
  • Kitāb al-Ḍuʿafāʾ(弱者之书)
  • Kitāb al-Ghayba(隐遁之书)by Abū ʿAbd Allah Muḥammad b. Ibrāhīm al-Nuʿmānī
  • Kitāb al-Ghayba(隐遁之书)by Shaykh Muḥammad b. al-Ḥasan al-Ṭūsī
  • Thawāb al-Aʿmāl wa ʿiqāb al-Aʿmāl(行为奖惩)
  • Kāmil al-Ziyārāt(完整的朝觐指南)
  • Faḍaʾil al-Shīʿa(什叶派的优点)
  • Ṣifāt al-Shīʿa(什叶派的属性)
  • Maʿānī al-ʾAkhbār(报告的含义)
  • Kitāb al-Muʾmin(信徒之书)
  • Kitāb al-Zuhd(禁欲之书)
  • Nahj al-Balāgha(雄辩之巅)

目的和应用

该数据集旨在展示古典阿拉伯语无与伦比的文学品质,与现代标准阿拉伯语不同,特别是在19世纪和20世纪欧洲翻译趋势的保存方面:

  • 机器翻译(MT)的改进:古典阿拉伯语复杂的语法结构和丰富的词汇对MT系统提出了独特的挑战,推动了翻译准确性和流畅性的边界。
  • 语言模型的发展:这些文本作为训练复杂大型语言模型(LLMs)的基础,能够理解和复制古典阿拉伯语的深度。
  • 语言遗产的保存:该数据集保存了古典阿拉伯语的原始形式,为现代写作提供了一个卓越的标准。

建议的研究应用:迭代翻译改进

该数据集的一个显著应用是通过回译来提高当代阿拉伯语写作。现有模型可以将英语文本回译成阿拉伯语,可能产生较不精细的形式。这提供了一个机会:

  • 生成不完美的阿拉伯语文本集:使用回译创建一个比原始古典阿拉伯语更不精细的阿拉伯语文本集。
  • 训练改进模型:开发模型,通过与原始文本比较来改进不完美的阿拉伯语,旨在恢复古典的雄辩。
  • 提高当代阿拉伯语写作:将这些模型应用于现代阿拉伯语文本,通过注入古典风格元素来提高其文学质量,使语言与古典根源产生共鸣。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作