locailabs/nemotron-chat-welsh

Name: locailabs/nemotron-chat-welsh
Creator: locailabs
Published: 2026-04-12 07:57:24
License: 暂无描述

Hugging Face2026-04-12 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/locailabs/nemotron-chat-welsh

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - cy - en license: cc-by-4.0 task_categories: - text-generation tags: - welsh - cymraeg - nemotron - synthetic - translation - sft --- # Nemotron Instruction Following Chat — Welsh (Cymraeg) Welsh-language supervised fine-tuning dataset translated from the NVIDIA Nemotron Instruction Following Chat dataset using an LLM translation pipeline. ## Dataset summary | Split | Count | Description | |-------|-------|-------------| | `train` | 27807 | Welsh translations of English chat instruction-following examples | ## How this dataset was made ### 1. Source data Examples were drawn from `nvidia/Nemotron-Instruction-Following-Chat-v1` (`chat_if` split, filtered to `capability_target == chat`), yielding 336,831 candidate rows. ### 2. Preprocessing Each row was preprocessed as follows: 1. **Conversation flattening** — multi-turn messages stripped of system prompts and truncated to the first user–assistant turn, producing `{prompt, response}` pairs. 2. **Content filtering** — rows containing model self-identification strings (e.g. "nemotron") removed (64,157 rows filtered). 3. **Language detection** — rows where either the prompt or response was detected as non-English were removed using `lingua-language-detector` (12,924 rows filtered), leaving **259,750 rows**. ### 3. Translation Each prompt and response was independently translated from English to Welsh using `qwen/qwen3.5-35b-a3b` via OpenRouter (non-thinking/instruct mode). The pipeline uses async calls (50 concurrent) with per-row retry logic and incremental JSONL output for resume support. Translation prompt instructs the model to preserve XML tags, URLs, mathematical formulas, and code blocks verbatim. ### 4. Format Each row contains a `messages` list in standard chat format: ```json { "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ] } ``` ## Intended use Post-training / fine-tuning LLMs to add Welsh language capability. ## Source dataset `nvidia/Nemotron-Instruction-Following-Chat-v1`

提供机构：

locailabs

5,000+

优质数据集

54 个

任务类型

进入经典数据集