anke01/uyghur-dictionary-dataset

Name: anke01/uyghur-dictionary-dataset
Creator: anke01
Published: 2026-03-01 08:57:15
License: 暂无描述

Hugging Face2026-03-01 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/anke01/uyghur-dictionary-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ug - zh - en license: apache-2.0 library_name: datasets tags: - translation - dictionary - uyghur - chinese - english - multilingual task_categories: - translation size_categories: - 1M<n<10M --- # 维吾尔语多语言词典数据集维吾尔语-汉语-英语多语言词典数据集，适用于大型语言模型(LLM)微调训练。 ## 数据集统计 | 数据集 | 条目数 | 大小 | 语言方向 | |--------|--------|------|----------| | `ug-cn.jsonl` | 3,079,016 | 456 MB | 维吾尔语 ⟷ 汉语 | | `ug-en.jsonl` | 685,836 | 108 MB | 维吾尔语 ⟷ 英语 | | `en-ug.jsonl` | 705,534 | 112 MB | 英语 ⟷ 维吾尔语 | | `ug-ug.jsonl` | 139,920 | 46 MB | 维吾尔语释义 | | `cn-cn.jsonl` | 127,646 | 39 MB | 汉语释义 | **总计**: 4,737,952 条（双向）/ 约 236 万唯一对 ## 数据格式 ```json { "instruction": "请翻译以下维吾尔语词汇", "input": "مەركىزى", "output": "中心的，中央的" } ``` 字段说明： - `instruction`: 任务指令 - `input`: 输入文本 - `output`: 输出文本 ## 使用示例 ```python import json with open('uyghur_data/output/ug-cn.jsonl', 'r', encoding='utf-8') as f: for line in f: data = json.loads(line) print(f"输入: {data['input']}") print(f"输出: {data['output']}") ``` ## 数据来源 - Gheyret 词典 (~37万条) - Tilkan 词典数据库 (150MB) - Bilkan 词典 (5方向) - Online Dictionary (~33万条) ## 文件位置 ``` uyghur_data/output/ ├── ug-cn.jsonl # 维汉数据集 ├── ug-en.jsonl # 维英数据集 ├── en-ug.jsonl # 英维数据集 ├── ug-ug.jsonl # 维维释义 └── cn-cn.jsonl # 汉汉释义 ``` --- **版本**: 1.0 | **更新**: 2025-03-01

提供机构：

anke01

5,000+

优质数据集

54 个

任务类型

进入经典数据集