five

Thanhjash/EduOmni

收藏
Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Thanhjash/EduOmni
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - vi - en license: cc-by-4.0 task_categories: - text-generation - question-answering pretty_name: EduOmni Vietnamese-English Educational Tutor Dataset size_categories: - 10K<n<100K tags: - education - vietnamese - bilingual - math - grammar - STEM - tutoring - qwen - fine-tuning configs: - config_name: sft data_files: - split: train path: sft_train.jsonl - config_name: dpo data_files: - split: train path: dpo_train.jsonl - config_name: test_metamath data_files: - split: test path: test/metamath_test.jsonl - config_name: test_scques data_files: - split: test path: test/scques_test.jsonl - config_name: test_vnhsge data_files: - split: test path: test/vnhsge_test.jsonl --- # EduOmni: Vietnamese-English Educational Tutor Dataset <div align="center"> 🎓 **37K+ Training Samples** | 📚 **7K DPO Pairs** | 🧪 **2.2K Test Samples** A comprehensive bilingual (Vietnamese-English) dataset for fine-tuning LLMs as educational tutors </div> ## 💻 Usage ### ✅ Correct Way to Load **Use the `name` parameter to specify which configuration:** ```python from datasets import load_dataset # Load SFT training data sft_dataset = load_dataset("Thanhjash/EduOmni", name="sft", split="train") # Load DPO training data dpo_dataset = load_dataset("Thanhjash/EduOmni", name="dpo", split="train") # Load test sets test_math = load_dataset("Thanhjash/EduOmni", name="test_metamath", split="test") test_grammar = load_dataset("Thanhjash/EduOmni", name="test_scques", split="test") test_stem = load_dataset("Thanhjash/EduOmni", name="test_vnhsge", split="test") ``` ### Format for Training ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct", trust_remote_code=True) # Format SFT data def format_sft(example): return { "text": tokenizer.apply_chat_template( example["messages"], tokenize=False, add_generation_prompt=False ) } formatted = sft_dataset.map(format_sft, remove_columns=sft_dataset.column_names) ``` ## 📊 Statistics ### SFT Training Data - **Total**: 37,066 samples - **Average Length**: 170 tokens **Distribution:** - MetaMathQA (Vietnamese): 14,971 (40.4%) - SC-Ques (Grammar): 10,000 (27.0%) - Vi-Alpaca (General): 7,999 (21.6%) - CoEdIT (Editing): 3,000 (8.1%) - VNHSGE (STEM): 880 (2.4%) - IELTS Speaking: 216 (0.6%) ### DPO Training Data - **Total**: 7,000 preference pairs - **Format**: Chosen vs Rejected responses ### Test Sets - MetaMathQA: 1,000 math problems - SC-Ques: 1,000 grammar questions - VNHSGE: 220 STEM exam questions ## 🎓 Use Cases - Fine-tuning Qwen models for Vietnamese education - Training bilingual tutoring assistants - STEM education chatbots - Grammar correction systems ## ⚖️ License CC-BY-4.0 ## 📚 Sources - IELTS Speaking, CoEdIT, SC-Ques, MetaMathQA-VI, VNHSGE, Vi-Alpaca, ORPO-DPO Mix See full documentation in repository for detailed attribution. --- Made with ❤️ for Vietnamese education • Powered by Qwen
提供机构:
Thanhjash
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作