five

Snaseem2026/synthetic-multilingual-instructions

收藏
Hugging Face2026-01-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Snaseem2026/synthetic-multilingual-instructions
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_name: synthetic-multilingual-instructions dataset_description: | A massive, copyright-free dataset of synthetic instruction–response pairs in English, French, German, Spanish, Italian, and Arabic, generated using open-source LLMs and translation models. Suitable for training and evaluating large language models, chatbots, and multilingual systems. features: - name: instruction dtype: string - name: response dtype: string - name: language dtype: string - name: topic dtype: string - name: complexity dtype: string tags: - text - multilingual - synthetic-data - instruction-following - jsonl - open-source - language-models - translation - ai - dataset language: - en - fr - de - es - it - ar formats: - jsonl size_categories: - 1K<n<10K license: apache-2.0 task_categories: - text-classification library: - datasets - pandas - polars --- # Synthetic Multilingual Instruction Dataset This dataset contains millions of synthetic, copyright-free instruction–response pairs covering practical, everyday scenarios. Each record includes: - `instruction`: The user prompt or question - `response`: The synthetic answer - `language`: Language code (e.g., 'en', 'fr', 'de', 'es', 'it', 'ar') - `topic`: General topic (e.g., 'home repair', 'finance') - `complexity`: One of 'basic', 'intermediate', 'advanced' ## Available Files - `instructions_en.jsonl` — English (1M records) - `instructions_fr.jsonl` — French (1000 records) - `instructions_de.jsonl` — German (1420 records) - `instructions_es.jsonl` — Spanish (1375 records) - `instructions_it.jsonl` — Italian (1069 records) - `instructions_ar.jsonl` — Arabic (1184 records) ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions') ## Usage Examples **Load the English dataset:** ```python from datasets import load_dataset ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_en.jsonl') print(ds['train'][0]) ``` **Load the French dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_fr.jsonl') print(ds['train'][0]) ``` **Load the German dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_de.jsonl') print(ds['train'][0]) ``` **Load the Spanish dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_es.jsonl') print(ds['train'][0]) ``` **Load the Italian dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_it.jsonl') print(ds['train'][0]) ``` **Load the Arabic dataset:** ```python ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_ar.jsonl') print(ds['train'][0]) ``` ## Citation If you use this dataset, please cite: ``` @misc{snaseem2026_synthetic_multilingual_instructions, title={Synthetic Multilingual Instruction Dataset}, author={Snaseem2026}, year={2026}, howpublished={\url{https://huggingface.co/datasets/Snaseem2026/synthetic-multilingual-instructions}} } ```
提供机构:
Snaseem2026
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作