five

hermes3-uk

收藏
魔搭社区2025-12-05 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/lapa-llm/hermes3-uk
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Hermes 3 Ukrainian Fixed Conversations ## Dataset Description **Dataset Summary** **hermes3-uk-fixed** is a Ukrainian translation of the **[NousResearch/Hermes-3-Dataset]**. The translation was produced with *Gemma 3 27B (instruction-tuned)*. During preparation we **removed all system prompts** and **normalized the message roles and content** to match the common schema we use across our dialog datasets. **Languages** - Ukrainian (uk) ## Dataset Structure **Data Fields** - `conversations`: list of messages in a dialog (array of objects) - `from`: normalized sender role — **`user`** or **`assistant`** (system messages are removed) - `value`: message text **Splits** - `train`: full set of dialogs (count matches the source dataset after filtering/merging). ## Dataset Creation **Source Data** - Base dataset: *NousResearch/Hermes-3-Dataset* (Apache-2.0). - Translation: inference with *google/gemma-3-27b-it*. **Processing / “Fixed” Changes** 1. **Removed `system`**: dropped all system-level messages from `conversations`. 2. **Role normalization**: mapped `human` → `user`, `gpt` → `assistant`; collapsed any aliases to these two roles. **Intended Uses** - Instruction/chat LLM training in Ukrainian - Continued pretraining or SFT on multi-turn dialogs - Research on adapting Hermes-style dialogs to Ukrainian ## Considerations for Using the Data **Social Impact** Aims to strengthen the Ukrainian-language LLM ecosystem and improve accessibility of language technology for Ukrainian speakers. **Bias & Limitations** - Hermes-3 includes neutral and roleplay/creative dialogs — stylistic bias may be present. - Machine translation can introduce subtle errors or tone shifts. - Removing `system` prompts may drop context that affected certain responses. **Recommendations** - Suitable for general dialog fine-tuning and instruction following. - For specialized domains, apply additional filtering and QA. ## How to Use ```python from datasets import load_dataset ds = load_dataset("le-llm/hermes3-uk-fixed", split="train") print(ds[0]["conversations"][:2]) # [{'from': 'user', 'value': '...'}, {'from': 'assistant', 'value': '...'}] ``` ## Citation When using this dataset, please cite the original sources: - Nous Research. *Hermes-3-Dataset* (Apache-2.0), Hugging Face Datasets. - Google DeepMind. *Gemma 3 27B (IT)*, Hugging Face Models / Gemma 3 documentation. **BibTeX** TBD, pleliminary below ```bibtex @dataset{hermes3_uk_fixed_2025, title = {Hermes 3 Ukrainian Fixed Conversations}, author = {le-llm}, year = {2025}, url = {https://huggingface.co/datasets/lapa-llm/hermes3-uk-fixed} } @misc{nous_hermes3_dataset, title = {Hermes-3-Dataset}, author = {Nous Research}, howpublished = {Hugging Face Datasets}, year = {2024}, url = {https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset} } @misc{gemma3_27b_it_2025, title = {Gemma 3 27B (instruction-tuned)}, author = {Google DeepMind}, howpublished = {Hugging Face Models}, year = {2025}, url = {https://huggingface.co/google/gemma-3-27b-it} } ``` ## License CC-BY-SA-4.0 --- *This dataset is part of the "Lapa" - Ukrainian LLM initiative to advance natural language processing for the Ukrainian language.*

# Hermes 3 乌克兰语修正对话数据集卡片 ## 数据集描述 ### 数据集概览 **hermes3-uk-fixed** 是 **[NousResearch/Hermes-3-Dataset]** 的乌克兰语译本,该翻译由 *Gemma 3 27B(指令微调版)* 生成。在数据预处理阶段,我们**移除了所有系统提示词**,并**标准化了消息角色与内容**,以匹配我们在各类对话数据集中通用的schema。 ### 语言 - 乌克兰语(uk) ## 数据集结构 ### 数据字段 - `conversations`:对话消息列表(对象数组) - `from`:标准化的发送者角色,仅支持 **`user`(用户)** 或 **`assistant`(助手)**(系统消息已被移除) - `value`:消息文本内容 ### 数据集划分 - `train`:完整对话集(经过滤与合并后,样本数量与源数据集一致) ## 数据集构建 ### 源数据 - 基础数据集:*NousResearch/Hermes-3-Dataset*(协议:Apache-2.0)。 - 翻译流程:通过 *google/gemma-3-27b-it* 模型进行推理生成。 ### 预处理/“修正”调整 1. **移除系统消息**:删除`conversations`中的所有系统级消息。 2. **角色标准化**:将`human`映射为`user`,`gpt`映射为`assistant`;将所有别名统一归为这两类角色。 ## 预期用途 - 乌克兰语指令/对话大语言模型(LLM)训练 - 多轮对话的持续预训练或监督微调(SFT,Supervised Fine-Tuning) - 研究将Hermes风格对话适配至乌克兰语场景 ## 数据使用注意事项 ### 社会影响 旨在强化乌克兰语大语言模型生态,提升乌克兰使用者对语言技术的可及性。 ### 偏差与局限性 - Hermes-3 包含中性、角色扮演/创意类对话,可能存在风格偏差。 - 机器翻译可能引入细微错误或语调偏移。 - 移除系统提示词可能丢失影响部分回复的上下文信息。 ### 使用建议 - 适用于通用对话微调与指令遵循任务。 - 针对专业领域场景,需额外进行过滤与质量检查(QA)。 ## 使用方法 python from datasets import load_dataset ds = load_dataset("le-llm/hermes3-uk-fixed", split="train") print(ds[0]["conversations"][:2]) # [{"from": "user", "value": "..."}, {"from": "assistant", "value": "..."}] ## 引用说明 使用本数据集时,请引用以下原始来源: - Nous Research. *Hermes-3-Dataset*(Apache-2.0协议),Hugging Face 数据集。 - Google DeepMind. *Gemma 3 27B(IT版)*,Hugging Face 模型 / Gemma 3 文档。 ### BibTeX 暂未正式定稿,以下为初步格式: bibtex @dataset{hermes3_uk_fixed_2025, title = {Hermes 3 Ukrainian Fixed Conversations}, author = {le-llm}, year = {2025}, url = {https://huggingface.co/datasets/lapa-llm/hermes3-uk-fixed} } @misc{nous_hermes3_dataset, title = {Hermes-3-Dataset}, author = {Nous Research}, howpublished = {Hugging Face Datasets}, year = {2024}, url = {https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset} } @misc{gemma3_27b_it_2025, title = {Gemma 3 27B (instruction-tuned)}, author = {Google DeepMind}, howpublished = {Hugging Face Models}, year = {2025}, url = {https://huggingface.co/google/gemma-3-27b-it} } ## 许可证 CC-BY-SA-4.0 --- *本数据集隶属于“Lapa”项目——一项旨在推进乌克兰语自然语言处理发展的乌克兰语大语言模型倡议。*
提供机构:
maas
创建时间:
2025-10-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作