locailabs/nemotron-chat-welsh
收藏Hugging Face2026-04-12 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/nemotron-chat-welsh
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- cy
- en
license: cc-by-4.0
task_categories:
- text-generation
tags:
- welsh
- cymraeg
- nemotron
- synthetic
- translation
- sft
---
# Nemotron Instruction Following Chat — Welsh (Cymraeg)
Welsh-language supervised fine-tuning dataset translated from the NVIDIA Nemotron
Instruction Following Chat dataset using an LLM translation pipeline.
## Dataset summary
| Split | Count | Description |
|-------|-------|-------------|
| `train` | 27807 | Welsh translations of English chat instruction-following examples |
## How this dataset was made
### 1. Source data
Examples were drawn from `nvidia/Nemotron-Instruction-Following-Chat-v1`
(`chat_if` split, filtered to `capability_target == chat`), yielding 336,831
candidate rows.
### 2. Preprocessing
Each row was preprocessed as follows:
1. **Conversation flattening** — multi-turn messages stripped of system prompts
and truncated to the first user–assistant turn, producing `{prompt, response}`
pairs.
2. **Content filtering** — rows containing model self-identification strings
(e.g. "nemotron") removed (64,157 rows filtered).
3. **Language detection** — rows where either the prompt or response was detected
as non-English were removed using `lingua-language-detector` (12,924 rows
filtered), leaving **259,750 rows**.
### 3. Translation
Each prompt and response was independently translated from English to Welsh using
`qwen/qwen3.5-35b-a3b` via OpenRouter (non-thinking/instruct mode). The pipeline
uses async calls (50 concurrent) with per-row retry logic and incremental JSONL
output for resume support.
Translation prompt instructs the model to preserve XML tags, URLs, mathematical
formulas, and code blocks verbatim.
### 4. Format
Each row contains a `messages` list in standard chat format:
```json
{
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
```
## Intended use
Post-training / fine-tuning LLMs to add Welsh language capability.
## Source dataset
`nvidia/Nemotron-Instruction-Following-Chat-v1`
提供机构:
locailabs



