Snaseem2026/synthetic-multilingual-instructions

Name: Snaseem2026/synthetic-multilingual-instructions
Creator: Snaseem2026
Published: 2026-01-14 11:27:09
License: 暂无描述

Hugging Face2026-01-14 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Snaseem2026/synthetic-multilingual-instructions

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_name: synthetic-multilingual-instructions dataset_description: | A massive, copyright-free dataset of synthetic instruction–response pairs in English, French, German, Spanish, Italian, and Arabic, generated using open-source LLMs and translation models. Suitable for training and evaluating large language models, chatbots, and multilingual systems. features: - name: instruction dtype: string - name: response dtype: string - name: language dtype: string - name: topic dtype: string - name: complexity dtype: string tags: - text - multilingual - synthetic-data - instruction-following - jsonl - open-source - language-models - translation - ai - dataset language: - en - fr - de - es - it - ar formats: - jsonl size_categories: - 1K<n<10K license: apache-2.0 task_categories: - text-classification library: - datasets - pandas - polars --- # Synthetic Multilingual Instruction Dataset This dataset contains millions of synthetic, copyright-free instruction–response pairs covering practical, everyday scenarios. Each record includes: - `instruction`: The user prompt or question - `response`: The synthetic answer - `language`: Language code (e.g., 'en', 'fr', 'de', 'es', 'it', 'ar') - `topic`: General topic (e.g., 'home repair', 'finance') - `complexity`: One of 'basic', 'intermediate', 'advanced' ## Available Files - `instructions_en.jsonl` — English (1M records) - `instructions_fr.jsonl` — French (1000 records) - `instructions_de.jsonl` — German (1420 records) - `instructions_es.jsonl` — Spanish (1375 records) - `instructions_it.jsonl` — Italian (1069 records) - `instructions_ar.jsonl` — Arabic (1184 records) ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions') ## Usage Examples **Load the English dataset:** ```python from datasets import load_dataset ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_en.jsonl') print(ds['train'][0]) ``` **Load the French dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_fr.jsonl') print(ds['train'][0]) ``` **Load the German dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_de.jsonl') print(ds['train'][0]) ``` **Load the Spanish dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_es.jsonl') print(ds['train'][0]) ``` **Load the Italian dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_it.jsonl') print(ds['train'][0]) ``` **Load the Arabic dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_ar.jsonl') print(ds['train'][0]) ``` ## Citation If you use this dataset, please cite: ``` @misc{snaseem2026_synthetic_multilingual_instructions, title={Synthetic Multilingual Instruction Dataset}, author={Snaseem2026}, year={2026}, howpublished={\url{https://huggingface.co/datasets/Snaseem2026/synthetic-multilingual-instructions}} } ```

提供机构：

Snaseem2026

5,000+

优质数据集

54 个

任务类型

进入经典数据集