Snaseem2026/synthetic-multilingual-instructions
收藏Hugging Face2026-01-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Snaseem2026/synthetic-multilingual-instructions
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_name: synthetic-multilingual-instructions
dataset_description: |
A massive, copyright-free dataset of synthetic instruction–response pairs in English, French, German, Spanish, Italian, and Arabic, generated using open-source LLMs and translation models. Suitable for training and evaluating large language models, chatbots, and multilingual systems.
features:
- name: instruction
dtype: string
- name: response
dtype: string
- name: language
dtype: string
- name: topic
dtype: string
- name: complexity
dtype: string
tags:
- text
- multilingual
- synthetic-data
- instruction-following
- jsonl
- open-source
- language-models
- translation
- ai
- dataset
language:
- en
- fr
- de
- es
- it
- ar
formats:
- jsonl
size_categories:
- 1K<n<10K
license: apache-2.0
task_categories:
- text-classification
library:
- datasets
- pandas
- polars
---
# Synthetic Multilingual Instruction Dataset
This dataset contains millions of synthetic, copyright-free instruction–response pairs covering practical, everyday scenarios. Each record includes:
- `instruction`: The user prompt or question
- `response`: The synthetic answer
- `language`: Language code (e.g., 'en', 'fr', 'de', 'es', 'it', 'ar')
- `topic`: General topic (e.g., 'home repair', 'finance')
- `complexity`: One of 'basic', 'intermediate', 'advanced'
## Available Files
- `instructions_en.jsonl` — English (1M records)
- `instructions_fr.jsonl` — French (1000 records)
- `instructions_de.jsonl` — German (1420 records)
- `instructions_es.jsonl` — Spanish (1375 records)
- `instructions_it.jsonl` — Italian (1069 records)
- `instructions_ar.jsonl` — Arabic (1184 records)
ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions')
## Usage Examples
**Load the English dataset:**
```python
from datasets import load_dataset
ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_en.jsonl')
print(ds['train'][0])
```
**Load the French dataset:**
```python
ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_fr.jsonl')
print(ds['train'][0])
```
**Load the German dataset:**
```python
ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_de.jsonl')
print(ds['train'][0])
```
**Load the Spanish dataset:**
```python
ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_es.jsonl')
print(ds['train'][0])
```
**Load the Italian dataset:**
```python
ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_it.jsonl')
print(ds['train'][0])
```
**Load the Arabic dataset:**
```python
ds = load_dataset('Snaseem2026/synthetic-multilingual-instructions', data_files='instructions_ar.jsonl')
print(ds['train'][0])
```
## Citation
If you use this dataset, please cite:
```
@misc{snaseem2026_synthetic_multilingual_instructions,
title={Synthetic Multilingual Instruction Dataset},
author={Snaseem2026},
year={2026},
howpublished={\url{https://huggingface.co/datasets/Snaseem2026/synthetic-multilingual-instructions}}
}
```
提供机构:
Snaseem2026



