Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün Türkçe Veri Seti
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13375017
下载链接
链接失效反馈官方服务:
资源简介:
Önemli Not: Bu veri setinin cevap sütununda bir hata tespit edildi ve bu hata yeni sürümünde düzeltildi. Bu nedenle, son sürümünün kullanılması büyük önem taşımaktadır.
Important Note: There was an error in the answer column of this dataset, which has been fixed in version the latest version. It is very important to use the latest version.
Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün Türkçe Veri Seti (Turkish MMLU: The Most Comprehensive and Original Turkish Dataset for AI and Academic Applications)
The Turkish MMLU: The Most Comprehensive and Original Turkish Dataset for AI and Academic Applications dataset is a comprehensive and original resource, specifically designed for training, fine-tuning, and evaluating AI models in Turkish. With 293,468 questions, this dataset stands as the most extensive collection in its field, covering a wide range of academic and professional subjects relevant to Turkey, including major exams like TUS (Medical Specialization Examination), KPSS (Public Personnel Selection Examination), and many others.
Key Features:
• Completely Original Content: The dataset is entirely created from original Turkish sources, ensuring authenticity and relevance to the Turkish context. It has not been translated from other languages, which is crucial for maintaining the integrity of the language data.
• Extensive Data Volume: With nearly 300,000 questions, the dataset offers a substantial corpus for training models, enabling deep learning algorithms to gain a nuanced understanding of the Turkish language across diverse topics.
• Detailed Structure: The dataset is organized into six key columns:
• ‘bölüm’ (section): Indicates the broader exam or category.
• ‘konu’ (subject): Specifies the topic within the section.
• ‘soru’ (question): The question text itself.
• ‘cevap’ (answer): The correct answer to the question.
• ‘aciklama’ (explanation): Provides additional context or reasoning for the answer, crucial for models to understand the logic behind correct responses.
• ‘secenekler’ (options): The possible answer choices, essential for multiple-choice formats.
• Wide Range of Sections and Subjects: The dataset includes 67 sections covering over 800 unique subjects. These sections span from specialized medical fields in TUS to general knowledge and vocational exams like KPSS and Ehliyet, ensuring that the dataset reflects the complexity and breadth of Turkish academic and professional content.
Dataset Source and Usage:
• Data Source: The dataset is compiled from publicly available data on the internet. While care has been taken to ensure that the data is original, there may be instances where some questions contain copyrighted material. If any copyright holders identify their material within the dataset, they are encouraged to contact the author, and the specific question will be promptly removed.
• Non-Commercial Use: This dataset is strictly intended for research and academic purposes. It cannot be used for commercial purposes under any circumstances.
Importance for AI Models:
1. Training: The vast number of questions, coupled with detailed explanations, makes this dataset an invaluable resource for training AI models to understand and process Turkish at a high level. The diversity of topics also ensures that the model is exposed to a wide range of vocabulary, concepts, and linguistic structures.
2. Fine-Tuning: For researchers and developers looking to fine-tune existing models, such as GPT, BERT, or other transformer-based architectures, this dataset offers domain-specific content that can significantly enhance performance in areas like medical language processing, legal text analysis, or general-purpose Turkish language understanding.
3. Evaluation: The Turkish MMLU dataset is ideal for evaluating the performance of AI models in Turkish. With its rich content and structured format, it allows for rigorous testing across various subjects, helping to measure how well a model can comprehend and generate accurate responses in Turkish.
4. Real-World Application: Beyond academic research, this dataset is also highly applicable in developing AI-powered tools for exam preparation, automated tutoring systems, and educational applications that require a deep understanding of the Turkish language and its diverse domains.
Example Sections:
• Medical Exams (TUS): Includes specialized subjects such as Farmakoloji, Patoloji, Mikrobiyoloji, and more, which are critical for training models intended for medical documentation or decision support systems.
• Public and Professional Exams (KPSS): Encompasses a wide array of subjects like Genel Kültür, Tarih, Coğrafya, and Vatandaşlık, making it valuable for general-purpose models.
• Diverse Topics: Ranging from Dini Bilgiler and Futbol to İlahiyat and İşletme, this dataset provides a robust foundation for models that need to handle a variety of real-world questions in Turkish.
Potential Uses:
• Model Training: Utilize the dataset to train AI models from scratch, providing a foundational understanding of Turkish in both general and specialized contexts.
• Fine-Tuning Pre-Trained Models: Enhance existing models by fine-tuning them on this dataset, allowing them to achieve better performance in Turkish language tasks.
• Evaluation and Benchmarking: Test and benchmark the capabilities of AI models, ensuring they meet the necessary standards for comprehension and response generation in Turkish.
• AI-Powered Educational Tools: Develop intelligent tutoring systems or exam preparation tools that can assist students and professionals in mastering complex subjects.
Conclusion:
The Turkish MMLU dataset is not just a collection of questions and answers; it is a comprehensive and original tool designed to advance the development of AI in the Turkish language. Whether you are training new models, fine-tuning existing ones, or evaluating their performance, this dataset offers the depth and breadth needed to push the boundaries of natural language processing in Turkish. Its originality and extensive scope make it an indispensable resource for anyone working in this field.
创建时间:
2024-08-27



