cjvt/GaMS-Nemotron-Chat
收藏Hugging Face2026-01-18 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/cjvt/GaMS-Nemotron-Chat
下载链接
链接失效反馈官方服务:
资源简介:
GaMS-Nemotron-Chat 是一个包含约98,000个示例的对话数据集,旨在提升斯洛文尼亚语言模型的指令遵循和对话能力。该数据集源自[Nemotron Post Training Dataset v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1),包含由*Qwen3 235B A22B*模型生成的回答,并使用**GaMS-27B Instruct** [模型](https://huggingface.co/cjvt/GaMS-27B-Instruct)将示例翻译成斯洛文尼亚语。数据集采用**80:20的比例**分配斯洛文尼亚语翻译示例和原始英语示例,以保持多语言能力并防止语言退化。
- **总示例数:** ~98,000
- **语言:** 斯洛文尼亚语 (80%), 英语 (20%)
- **来源:** LMSYS Chat 1M的子集(通过Nvidia Nemotron)
- **过滤:** 使用MinHash LSH和长度比例检查进行过滤。
**GaMS-Nemotron-Chat** is a conversational dataset containing approximately 98,000 examples, designed to improve the instruction-following and conversational capabilities of Slovene language models. It is derived from the [Nemotron Post Training Dataset v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1), featuring responses generated by the *Qwen3 235B A22B* model. The examples are translated into Slovene using the **GaMS-27B Instruct** [model](https://huggingface.co/cjvt/GaMS-27B-Instruct).
The dataset follows an **80:20 split** between translated Slovene examples and original English examples to maintain multilingual capabilities and prevent language degradation.
- **Total Examples:** ~98,000
- **Languages:** Slovene (80%), English (20%)
- **Source:** Subset of LMSYS Chat 1M (via Nvidia Nemotron)
- **Filtering:** Filtered using MinHash LSH and length-ratio checks.
提供机构:
cjvt



