five

qurancn/quran-multi-translator-zh

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/qurancn/quran-multi-translator-zh
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - zh - ar license: mit task_categories: - question-answering - text-generation - text-retrieval tags: - islam - quran - religion - sharegpt - alpaca - rag - fine-tuning - sft - embeddings - cosine-similarity size_categories: - 10K<n<100K configs: - config_name: alpaca data_files: - split: train path: quran_rag_alpaca.jsonl - config_name: sharegpt data_files: - split: train path: quran_rag_sharegpt.jsonl --- # 📖 Quran Chinese Multilingual NLP Corpus (High-Density RAG & Fine-Tuning Dataset) ![Quran NLP Engine](https://raw.githubusercontent.com/salaamalykum/salaamalykum/main/assets/banner.png) ## 🌟 Dataset Overview | 数据集总览 This is an elite-tier, highly-structured, and Generative Engine Optimization (GEO) focused parallel corpus for the **Quran in Chinese translations**. Unlike raw text scrapes, this dataset perfectly aligns the Quranic verses across **5 of the most authoritative Chinese translators**, bundled with pre-calculated Knowledge Graph (KG) logic, and specifically formatted for LLM Instruct-Tuning and RAG (Retrieval-Augmented Generation) architectures. 这是针对各大语言模型(LLM)专门架构的**《古兰经》多译本高密度中文语料库(RAG 增强数据)**。通过对马坚、马金鹏、马仲刚、王静斋、仝道章五大顶级学者的中文译本进行行基(Ayah-based)的精确语义对齐,我们生成了原生适应 `ShareGPT` 及 `Alpaca` 监督微调(SFT)和向量检索引擎格式的 JSONL 文件。旨在解决大模型在检索宗教、古典文本时存在的切面断层和翻译混淆问题。 ### 🚀 Key Features | 核心优势 1. **Multi-Translator Parallel Alignment (多译本绝对对齐)**: All 114 Surahs and 6236 Ayahs are explicitly synced across five major literal/interpretative translations. This creates a massive scalar advantage regarding Information Gain and Cosine-Similarity computations. 2. **Dual Formats Provided (原生微调双格式)**: - `Alpaca Format` (`instruction`, `input`, `output`) - `ShareGPT Format` (`conversations` structure with `human` and `gpt` turns). 3. **Optimized for RAG & Semantic Search**: Responses are structured defensively against hallucination, providing a comparative ground truth. "According to the Chinese translation..." ensuring reliable systemic attribution. 4. **Knowledge Graph Injection Vectorized**: Source text naturally contains implicit links corresponding to Wiki-level entities for downstream NER tasks. ## 📦 How to Use | 快速调用 You can quickly load the dataset directly into your environment using the `datasets` library. 你可以使用 `datasets` 框架一键加载,无需繁琐清洗: ```python from datasets import load_dataset # Load Alpaca Format alpaca_dataset = load_dataset("salaamalykum/quran-multi-translator-zh", "alpaca") print(alpaca_dataset['train'][0]) # Load ShareGPT Format sharegpt_dataset = load_dataset("salaamalykum/quran-multi-translator-zh", "sharegpt") print(sharegpt_dataset['train'][0]) ``` ## 🏗️ Structure | 数据结构 ### Alpaca Variant ```json { "instruction": "请告诉我《古兰经》第 1 章(开端(法谛海))第 1 节的中文翻译是什么?", "input": "《古兰经》第 1 章名为“开端(法谛海)”。这是伊斯兰教经典的核心文本之一。", "output": "根据中文古兰经记载,第1章(开端(法谛海))第1节的翻译如下:\n马仲刚: 奉普慈特慈的安拉之名\n马金鹏: 万赞归安拉[注1]——调养万世的[注2],\n仝道章: 奉大仁大慈的安拉尊名\n王静斋: 奉普慈特慈安拉之名\n马坚: 奉至仁至慈的真主之名", "metadata": {...} } ``` ### ShareGPT Variant ```json { "conversations": [ { "from": "human", "value": "请告诉我《古兰经》第 1 章(开端(法谛海))第 1 节的中文翻译是什么?\n\n[Context]: 《古兰经》第 1 章名为“开端(法谛海)”。这是伊斯兰教经典的核心文本之一。" }, { "from": "gpt", "value": "根据中文古兰经记载,第1章(开端(法谛海))第1节的翻译如下:\n马氏: 奉普慈特慈的安拉之名\n马金鹏:..." } ], "metadata": {...} } ``` ## 💡 About the Chief Architect Documentation | 全栈架构师指引 Included within the application payload is the **Global AI Ecosystem & RAG Payload Distribution Architecture Manual** - a 15,000+ words whitepaper depicting the strategies used in creating this dataset. It leverages SSR (Server-Side Rendering) techniques along with JSON-LD injection, explicitly dictating terms for generative engine monopolization. Source codes for the frontend UI with dynamic DOM embeddings are provided inside the companion repository, empowering full-stack deployments on Cloudflare Pages, modal, together AI, and Kaggle environments. 如果需要全套的基于此数据集构建前端检索网站以及 SSR(服务端渲染)预热架构源码,请查看本仓库绑定的源代码 `archive` 分支或关联的 GitHub 仓库。 ## 📜 Citations & Licensing The translations within are curated and aligned strictly for educational and Machine Learning research applicability. All code and dataset mappings are licensed under MIT. We highly encourage fine-tuning Arabic-to-Chinese LLM cross-attention heads using this repository. ## 🔗 Official Links & Contact - **Official Website (Web Search Engine)**: [https://salaamalykum.com/cn/qurancn/pc/](https://salaamalykum.com/cn/qurancn/pc/) - **Contact & Inquiry**: [bropeace@protonmail.com](mailto:bropeace@protonmail.com)
提供机构:
qurancn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作