five

V4ldeLund/scandi-translated-instruct

收藏
Hugging Face2026-01-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/V4ldeLund/scandi-translated-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: scandi-translated-instruct language: - da - sv - nn - nb tags: - instruction-tuning - chat - conversational - machine-translation - nordic - multilingual task_categories: - text-generation size_categories: - 1M<n<10M license: other configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: language dtype: string - name: model dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: source dtype: string --- # scandi-translate A Scandinavian instruction‑tuning dataset built from machine‑translated instruction/response pairs. It unifies Danish, Swedish, Norwegian Bokmål and Norwegian Nynorsk data into a chat-friendly `messages` schema. ## Dataset size - Total rows: **1,252,683** - Per-language counts measured during build: - Danish (`da`): **377,413** - Swedish (`sv`): **332,558** - Norwegian Nynorsk (`nn`): **270,772** - Norwegian Bokmål (`nb`): **268,875** - By source: - `V4ldeLund/da-translated-instruct` - `akoksal/muri-it-language-split` (configs `swe`, `nor`) - `CohereLabs/aya_collection_language_split` (configs `swedish`, `norwegian_bokmal`, `norwegian_nynorsk`) - `neph1/Alpaca-Lora-GPT4-Swedish-Refined` (train split) ## Licences | Source dataset | License note (as of 2026‑01‑10) | | --- | --- | | `akoksal/muri-it-language-split` (swe/nor) | License **Apache-2.0** | `CohereLabs/aya_collection_language_split` | **Apache-2.0**. | `neph1/Alpaca-Lora-GPT4-Swedish-Refined` | License not listed on HF page | `V4ldeLund/da-translated-instruct` (contains `akoksal/muri-it-language-split` dan, `Mabeck/danish-OpenHermes`, `CohereLabs/aya_collection_language_split` danish) | Mixed Apache-2.0 / MIT Because of the mixture and unspecified items, the combined release is marked **“other”**. Every row keeps its `source` so downstream users can honor upstream terms. ## Languages included - Danish (`da`) - Swedish (`sv`) - Norwegian Nynorsk (`nn`) - Norwegian Bokmål (`nb`) ## Format Columns - `language` (string): ISO two letter language code (`da`, `sv`, `nn`, `nb`). - `model` (string): translation/model identifier used when generating the pair. - `source` (string): upstream dataset repo id. - `messages` (list): chat messages with - `role`: one of `system`, `user`, `assistant` - `content`: message text Example ```json { "model": "Mixtral-8x7B", "source": "akoksal/muri-it-language-split", "language": "nb", "messages": [ {"role": "user", "content": "Forklar kort fotosyntesen."}, {"role": "assistant", "content": "Fotosyntese er processen..." } ] } ```
提供机构:
V4ldeLund
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作