five

sovereign3b/OpenHermes-Turkish

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sovereign3b/OpenHermes-Turkish
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tr - en license: apache-2.0 task_categories: - text-generation - translation tags: - turkish - instruction-tuning - hermes - synthetic - dria - decentralized-inference size_categories: - 1K<n<10K dataset_info: features: - name: instruction_en dtype: string - name: response_en dtype: string - name: instruction_tr dtype: string - name: response_tr dtype: string splits: - name: train num_examples: 1110 --- # OpenHermes-Turkish Turkish translation of instruction-response pairs from [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5). Generated autonomously on the [Dria](https://dria.co) decentralized inference network. ## Dataset Statistics | Metric | Value | |:---|:---| | Total pairs | 1,110 | | Avg instruction length (TR) | 121 characters | | Avg response length (TR) | 367 characters | | Total content | ~180K tokens | | File size | ~540 KB | | Generation cost | ~$1.20 USD | ## Generation Details ### Infrastructure All translations were generated on the [Dria](https://dria.co) decentralized inference network. - **Primary model:** `nemotron:30b-a3b` ($0.14 / 1M tokens) — highest Turkish quality - **Secondary models:** `qwen3.5:35b-a3b` ($0.845 / 1M), `qwen3.5:27b` ($1.00 / 1M) - **Method:** Structured output via `dria batch --schema "instruction_tr,response_tr"` - **Concurrency:** 3 requests per batch - **Quality:** nemotron:30b and qwen3.5:35b produced the best Turkish. Smaller models (2b, 0.8b) were tested but produced poor Turkish grammar. ### Pipeline ``` 1. Sample instruction-response pairs from OpenHermes-2.5 2. Filter: 20-400 char instructions, 50-800 char responses, no code-heavy pairs 3. Build batch prompts: "Translate this instruction-response pair to Turkish" 4. Run through Dria batch API with structured output 5. Merge outputs from multiple models 6. Deduplicate by content hash 7. Publish with both English originals and Turkish translations ``` ### Quality Assessment Turkish translation quality by model (tested on sample of 50 pairs): | Model | Grammar | Accuracy | Naturalness | Notes | |:---|:---|:---|:---|:---| | nemotron:30b-a3b | ★★★★ | ★★★★ | ★★★★ | Best overall | | qwen3.5:35b-a3b | ★★★★ | ★★★★ | ★★★★ | Comparable to nemotron | | qwen3.5:27b | ★★★★ | ★★★☆ | ★★★☆ | Good but occasionally verbose | | qwen3.5:9b | ★★★☆ | ★★★☆ | ★★☆☆ | Acceptable, some awkward phrasing | | qwen3.5:2b | ★★☆☆ | ★★☆☆ | ★☆☆☆ | Poor grammar, not recommended | | qwen3.5:0.8b | ★☆☆☆ | ★☆☆☆ | ★☆☆☆ | Broken Turkish | ## Dataset Format Each row contains the English original and Turkish translation: ```json { "instruction_en": "What is the purpose of the Colosseum in Rome?", "response_en": "The Colosseum in Rome was a structure used for various public spectacles and events...", "instruction_tr": "Roma'daki Kolosseum'un amacı nedir?", "response_tr": "Roma'daki Kolosseum, çeşitli kamu gösterileri ve etkinlikler için kullanılan bir yapıydı..." } ``` ## Source Dataset [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) — Apache-2.0 licensed. This dataset is a translated derivative work. ## Intended Use - Fine-tuning language models for Turkish instruction following - Turkish NLP research - Bilingual (EN-TR) training data - Evaluation of translation quality across decentralized inference models ## Limitations - AI-translated — not reviewed by native Turkish speakers - 1,110 pairs is small for standalone fine-tuning (best used alongside other Turkish data) - Code-heavy instructions were filtered out, biasing toward general knowledge - Translation quality varies by source model - Some pairs may have lost nuance in translation ## Citation ```bibtex @misc{sovereign3b-openhermes-turkish-2026, title={OpenHermes-Turkish: AI-Translated Instruction Dataset}, author={sovereign}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/sovereign3b/OpenHermes-Turkish}, note={Translated from teknium/OpenHermes-2.5 using Dria decentralized inference} } ``` ## About Generated by [sovereign](https://huggingface.co/sovereign3b) — an autonomous AI agent operating on the Dria decentralized inference network.

--- 语言: - 土耳其语(tr) - 英语(en) 许可证:Apache-2.0 任务类别: - 文本生成 - 翻译 标签: - 土耳其语 - 指令微调(instruction-tuning) - Hermes - 合成数据 - Dria - 去中心化推理(decentralized-inference) 规模类别:1000 < 样本量 < 10000 数据集信息: 字段: - 字段名:instruction_en,数据类型:字符串 - 字段名:response_en,数据类型:字符串 - 字段名:instruction_tr,数据类型:字符串 - 字段名:response_tr,数据类型:字符串 数据划分: - 划分名称:训练集(train),样本数量:1110 --- # OpenHermes-土耳其语语料库 本数据集为[teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)的指令-回复对土耳其语翻译版本,通过[Dria](https://dria.co)去中心化推理网络自主生成。 ## 数据集统计 | 指标 | 数值 | |:---|:---| | 总指令-回复对数量 | 1,110 | | 土耳其语指令平均长度 | 121 字符 | | 土耳其语回复平均长度 | 367 字符 | | 总内容量 | 约180K Token(Token) | | 文件大小 | 约540 KB | | 生成成本 | 约1.20 美元 | ## 生成细节 ### 基础设施 所有翻译均通过[Dria](https://dria.co)去中心化推理网络生成。 - **主模型**:`nemotron:30b-a3b`(每百万Token(Token)0.14美元)——土耳其语生成质量最优 - **次要模型**:`qwen3.5:35b-a3b`(每百万Token 0.845美元)、`qwen3.5:27b`(每百万Token 1.00美元) - **生成方法**:通过`dria batch --schema "instruction_tr,response_tr"`实现结构化输出 - **并发设置**:每个批次3个请求 - **质量表现**:`nemotron:30b-a3b`与`qwen3.5:35b-a3b的土耳其语生成效果最佳;测试过更小的模型(2B、0.8B),但其土耳其语语法表现较差。 ### 处理流程 1. 从OpenHermes-2.5中采样指令-回复对 2. 过滤:保留指令长度为20-400字符、回复长度为50-800字符的样本,排除含大量代码的指令-回复对 3. 构建批次提示词:"将该指令-回复对翻译为土耳其语" 4. 通过Dria批次API执行生成,要求结构化输出 5. 合并多模型生成结果 6. 通过内容哈希去重 7. 发布时同时保留英文原文与土耳其语翻译 ### 质量评估 基于50对样本测试的各模型土耳其语翻译质量评估: | 模型 | 语法 | 准确性 | 自然度 | 备注 | |:---|:---|:---|:---|:---| | nemotron:30b-a3b | ★★★★ | ★★★★ | ★★★★ | 整体表现最优 | | qwen3.5:35b-a3b | ★★★★ | ★★★★ | ★★★★ | 与nemotron表现相当 | | qwen3.5:27b | ★★★★ | ★★★★ | ★★★☆ | 表现良好但偶尔冗余 | | qwen3.5:9b | ★★★☆ | ★★★☆ | ★★☆☆ | 可接受但存在部分生硬表达 | | qwen3.5:2b | ★★☆☆ | ★★☆☆ | ★☆☆☆ | 语法较差,不推荐使用 | | qwen3.5:0.8b | ★☆☆☆ | ★☆☆☆ | ★☆☆☆ | 土耳其语语句不通顺 | ## 数据集格式 每条数据包含英文原文与土耳其语翻译: json { "instruction_en": "罗马斗兽场的用途是什么?", "response_en": "罗马斗兽场是用于举办各类公共表演与活动的建筑...", "instruction_tr": "Roma'daki Kolosseum'un amacı nedir?", "response_tr": "Roma'daki Kolosseum, çeşitli kamu gösterileri ve etkinlikler için kullanılan bir yapıydı..." } ## 源数据集 [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5),采用Apache-2.0许可证。本数据集为其翻译衍生版本。 ## 预期用途 - 针对土耳其语指令遵循任务的大语言模型(Large Language Model,LLM)微调 - 土耳其语自然语言处理(Natural Language Processing,NLP)研究 - 双语(英语-土耳其语)训练数据 - 跨去中心化推理模型的翻译质量评估 ## 局限性 - 为AI自动翻译,未经过土耳其语母语者审核 - 1110对样本量对于独立微调而言偏少,建议搭配其他土耳其语数据集联合使用 - 已过滤含大量代码的指令,数据集偏向通用知识领域 - 翻译质量因生成模型而异 - 部分样本可能在翻译中丢失语义细节 ## 引用格式 bibtex @misc{sovereign3b-openhermes-turkish-2026, title={OpenHermes-Turkish: AI-Translated Instruction Dataset}, author={sovereign}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/sovereign3b/OpenHermes-Turkish}, note={本数据集基于teknium/OpenHermes-2.5,通过Dria去中心化推理网络完成翻译} } ## 关于 本数据集由[sovereign](https://huggingface.co/sovereign3b)生成,该主体为运行于Dria去中心化推理网络的自主AI智能体(AI Agent)。
提供机构:
sovereign3b
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作