Thanhjash/EduOmni
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Thanhjash/EduOmni
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- vi
- en
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
pretty_name: EduOmni Vietnamese-English Educational Tutor Dataset
size_categories:
- 10K<n<100K
tags:
- education
- vietnamese
- bilingual
- math
- grammar
- STEM
- tutoring
- qwen
- fine-tuning
configs:
- config_name: sft
data_files:
- split: train
path: sft_train.jsonl
- config_name: dpo
data_files:
- split: train
path: dpo_train.jsonl
- config_name: test_metamath
data_files:
- split: test
path: test/metamath_test.jsonl
- config_name: test_scques
data_files:
- split: test
path: test/scques_test.jsonl
- config_name: test_vnhsge
data_files:
- split: test
path: test/vnhsge_test.jsonl
---
# EduOmni: Vietnamese-English Educational Tutor Dataset
<div align="center">
🎓 **37K+ Training Samples** | 📚 **7K DPO Pairs** | 🧪 **2.2K Test Samples**
A comprehensive bilingual (Vietnamese-English) dataset for fine-tuning LLMs as educational tutors
</div>
## 💻 Usage
### ✅ Correct Way to Load
**Use the `name` parameter to specify which configuration:**
```python
from datasets import load_dataset
# Load SFT training data
sft_dataset = load_dataset("Thanhjash/EduOmni", name="sft", split="train")
# Load DPO training data
dpo_dataset = load_dataset("Thanhjash/EduOmni", name="dpo", split="train")
# Load test sets
test_math = load_dataset("Thanhjash/EduOmni", name="test_metamath", split="test")
test_grammar = load_dataset("Thanhjash/EduOmni", name="test_scques", split="test")
test_stem = load_dataset("Thanhjash/EduOmni", name="test_vnhsge", split="test")
```
### Format for Training
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct", trust_remote_code=True)
# Format SFT data
def format_sft(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False
)
}
formatted = sft_dataset.map(format_sft, remove_columns=sft_dataset.column_names)
```
## 📊 Statistics
### SFT Training Data
- **Total**: 37,066 samples
- **Average Length**: 170 tokens
**Distribution:**
- MetaMathQA (Vietnamese): 14,971 (40.4%)
- SC-Ques (Grammar): 10,000 (27.0%)
- Vi-Alpaca (General): 7,999 (21.6%)
- CoEdIT (Editing): 3,000 (8.1%)
- VNHSGE (STEM): 880 (2.4%)
- IELTS Speaking: 216 (0.6%)
### DPO Training Data
- **Total**: 7,000 preference pairs
- **Format**: Chosen vs Rejected responses
### Test Sets
- MetaMathQA: 1,000 math problems
- SC-Ques: 1,000 grammar questions
- VNHSGE: 220 STEM exam questions
## 🎓 Use Cases
- Fine-tuning Qwen models for Vietnamese education
- Training bilingual tutoring assistants
- STEM education chatbots
- Grammar correction systems
## ⚖️ License
CC-BY-4.0
## 📚 Sources
- IELTS Speaking, CoEdIT, SC-Ques, MetaMathQA-VI, VNHSGE, Vi-Alpaca, ORPO-DPO Mix
See full documentation in repository for detailed attribution.
---
Made with ❤️ for Vietnamese education • Powered by Qwen
提供机构:
Thanhjash



