cjvt/GaMS-Nemotron-Chat

Name: cjvt/GaMS-Nemotron-Chat
Creator: cjvt
Published: 2026-01-18 22:53:54
License: 暂无描述

Hugging Face2026-01-18 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/cjvt/GaMS-Nemotron-Chat

下载链接

链接失效反馈

官方服务：

资源简介：

GaMS-Nemotron-Chat 是一个包含约98,000个示例的对话数据集，旨在提升斯洛文尼亚语言模型的指令遵循和对话能力。该数据集源自[Nemotron Post Training Dataset v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1)，包含由*Qwen3 235B A22B*模型生成的回答，并使用**GaMS-27B Instruct** [模型](https://huggingface.co/cjvt/GaMS-27B-Instruct)将示例翻译成斯洛文尼亚语。数据集采用**80:20的比例**分配斯洛文尼亚语翻译示例和原始英语示例，以保持多语言能力并防止语言退化。 - **总示例数:** ~98,000 - **语言:** 斯洛文尼亚语 (80%), 英语 (20%) - **来源:** LMSYS Chat 1M的子集（通过Nvidia Nemotron） - **过滤:** 使用MinHash LSH和长度比例检查进行过滤。

**GaMS-Nemotron-Chat** is a conversational dataset containing approximately 98,000 examples, designed to improve the instruction-following and conversational capabilities of Slovene language models. It is derived from the [Nemotron Post Training Dataset v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1), featuring responses generated by the *Qwen3 235B A22B* model. The examples are translated into Slovene using the **GaMS-27B Instruct** [model](https://huggingface.co/cjvt/GaMS-27B-Instruct). The dataset follows an **80:20 split** between translated Slovene examples and original English examples to maintain multilingual capabilities and prevent language degradation. - **Total Examples:** ~98,000 - **Languages:** Slovene (80%), English (20%) - **Source:** Subset of LMSYS Chat 1M (via Nvidia Nemotron) - **Filtering:** Filtered using MinHash LSH and length-ratio checks.

提供机构：

cjvt

5,000+

优质数据集

54 个

任务类型

进入经典数据集