five

locailabs/nemotron-chat-welsh

收藏
Hugging Face2026-04-12 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/nemotron-chat-welsh
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - cy - en license: cc-by-4.0 task_categories: - text-generation tags: - welsh - cymraeg - nemotron - synthetic - translation - sft --- # Nemotron Instruction Following Chat — Welsh (Cymraeg) Welsh-language supervised fine-tuning dataset translated from the NVIDIA Nemotron Instruction Following Chat dataset using an LLM translation pipeline. ## Dataset summary | Split | Count | Description | |-------|-------|-------------| | `train` | 27807 | Welsh translations of English chat instruction-following examples | ## How this dataset was made ### 1. Source data Examples were drawn from `nvidia/Nemotron-Instruction-Following-Chat-v1` (`chat_if` split, filtered to `capability_target == chat`), yielding 336,831 candidate rows. ### 2. Preprocessing Each row was preprocessed as follows: 1. **Conversation flattening** — multi-turn messages stripped of system prompts and truncated to the first user–assistant turn, producing `{prompt, response}` pairs. 2. **Content filtering** — rows containing model self-identification strings (e.g. "nemotron") removed (64,157 rows filtered). 3. **Language detection** — rows where either the prompt or response was detected as non-English were removed using `lingua-language-detector` (12,924 rows filtered), leaving **259,750 rows**. ### 3. Translation Each prompt and response was independently translated from English to Welsh using `qwen/qwen3.5-35b-a3b` via OpenRouter (non-thinking/instruct mode). The pipeline uses async calls (50 concurrent) with per-row retry logic and incremental JSONL output for resume support. Translation prompt instructs the model to preserve XML tags, URLs, mathematical formulas, and code blocks verbatim. ### 4. Format Each row contains a `messages` list in standard chat format: ```json { "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ] } ``` ## Intended use Post-training / fine-tuning LLMs to add Welsh language capability. ## Source dataset `nvidia/Nemotron-Instruction-Following-Chat-v1`
提供机构:
locailabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作