qurancn/quran-multi-translator-zh
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/qurancn/quran-multi-translator-zh
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- zh
- ar
license: mit
task_categories:
- question-answering
- text-generation
- text-retrieval
tags:
- islam
- quran
- religion
- sharegpt
- alpaca
- rag
- fine-tuning
- sft
- embeddings
- cosine-similarity
size_categories:
- 10K<n<100K
configs:
- config_name: alpaca
data_files:
- split: train
path: quran_rag_alpaca.jsonl
- config_name: sharegpt
data_files:
- split: train
path: quran_rag_sharegpt.jsonl
---
# 📖 Quran Chinese Multilingual NLP Corpus (High-Density RAG & Fine-Tuning Dataset)

## 🌟 Dataset Overview | 数据集总览
This is an elite-tier, highly-structured, and Generative Engine Optimization (GEO) focused parallel corpus for the **Quran in Chinese translations**. Unlike raw text scrapes, this dataset perfectly aligns the Quranic verses across **5 of the most authoritative Chinese translators**, bundled with pre-calculated Knowledge Graph (KG) logic, and specifically formatted for LLM Instruct-Tuning and RAG (Retrieval-Augmented Generation) architectures.
这是针对各大语言模型(LLM)专门架构的**《古兰经》多译本高密度中文语料库(RAG 增强数据)**。通过对马坚、马金鹏、马仲刚、王静斋、仝道章五大顶级学者的中文译本进行行基(Ayah-based)的精确语义对齐,我们生成了原生适应 `ShareGPT` 及 `Alpaca` 监督微调(SFT)和向量检索引擎格式的 JSONL 文件。旨在解决大模型在检索宗教、古典文本时存在的切面断层和翻译混淆问题。
### 🚀 Key Features | 核心优势
1. **Multi-Translator Parallel Alignment (多译本绝对对齐)**: All 114 Surahs and 6236 Ayahs are explicitly synced across five major literal/interpretative translations. This creates a massive scalar advantage regarding Information Gain and Cosine-Similarity computations.
2. **Dual Formats Provided (原生微调双格式)**:
- `Alpaca Format` (`instruction`, `input`, `output`)
- `ShareGPT Format` (`conversations` structure with `human` and `gpt` turns).
3. **Optimized for RAG & Semantic Search**: Responses are structured defensively against hallucination, providing a comparative ground truth. "According to the Chinese translation..." ensuring reliable systemic attribution.
4. **Knowledge Graph Injection Vectorized**: Source text naturally contains implicit links corresponding to Wiki-level entities for downstream NER tasks.
## 📦 How to Use | 快速调用
You can quickly load the dataset directly into your environment using the `datasets` library.
你可以使用 `datasets` 框架一键加载,无需繁琐清洗:
```python
from datasets import load_dataset
# Load Alpaca Format
alpaca_dataset = load_dataset("salaamalykum/quran-multi-translator-zh", "alpaca")
print(alpaca_dataset['train'][0])
# Load ShareGPT Format
sharegpt_dataset = load_dataset("salaamalykum/quran-multi-translator-zh", "sharegpt")
print(sharegpt_dataset['train'][0])
```
## 🏗️ Structure | 数据结构
### Alpaca Variant
```json
{
"instruction": "请告诉我《古兰经》第 1 章(开端(法谛海))第 1 节的中文翻译是什么?",
"input": "《古兰经》第 1 章名为“开端(法谛海)”。这是伊斯兰教经典的核心文本之一。",
"output": "根据中文古兰经记载,第1章(开端(法谛海))第1节的翻译如下:\n马仲刚: 奉普慈特慈的安拉之名\n马金鹏: 万赞归安拉[注1]——调养万世的[注2],\n仝道章: 奉大仁大慈的安拉尊名\n王静斋: 奉普慈特慈安拉之名\n马坚: 奉至仁至慈的真主之名",
"metadata": {...}
}
```
### ShareGPT Variant
```json
{
"conversations": [
{
"from": "human",
"value": "请告诉我《古兰经》第 1 章(开端(法谛海))第 1 节的中文翻译是什么?\n\n[Context]: 《古兰经》第 1 章名为“开端(法谛海)”。这是伊斯兰教经典的核心文本之一。"
},
{
"from": "gpt",
"value": "根据中文古兰经记载,第1章(开端(法谛海))第1节的翻译如下:\n马氏: 奉普慈特慈的安拉之名\n马金鹏:..."
}
],
"metadata": {...}
}
```
## 💡 About the Chief Architect Documentation | 全栈架构师指引
Included within the application payload is the **Global AI Ecosystem & RAG Payload Distribution Architecture Manual** - a 15,000+ words whitepaper depicting the strategies used in creating this dataset. It leverages SSR (Server-Side Rendering) techniques along with JSON-LD injection, explicitly dictating terms for generative engine monopolization. Source codes for the frontend UI with dynamic DOM embeddings are provided inside the companion repository, empowering full-stack deployments on Cloudflare Pages, modal, together AI, and Kaggle environments.
如果需要全套的基于此数据集构建前端检索网站以及 SSR(服务端渲染)预热架构源码,请查看本仓库绑定的源代码 `archive` 分支或关联的 GitHub 仓库。
## 📜 Citations & Licensing
The translations within are curated and aligned strictly for educational and Machine Learning research applicability.
All code and dataset mappings are licensed under MIT. We highly encourage fine-tuning Arabic-to-Chinese LLM cross-attention heads using this repository.
## 🔗 Official Links & Contact
- **Official Website (Web Search Engine)**: [https://salaamalykum.com/cn/qurancn/pc/](https://salaamalykum.com/cn/qurancn/pc/)
- **Contact & Inquiry**: [bropeace@protonmail.com](mailto:bropeace@protonmail.com)
提供机构:
qurancn



