khursani8/cuti
收藏Hugging Face2025-12-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/khursani8/cuti
下载链接
链接失效反馈官方服务:
资源简介:
MALAYMMLU是一个全面的马来西亚教育预训练数据集,包含9,429个条目和3,187,502个单词。该数据集专门设计用于训练语言模型,涵盖STEM和人文学科的马来西亚教育内容。数据集包含25个教育科目,主要语言为马来语和英语。内容分为三个阶段:原子知识(12.7%)、综合材料(40.1%)和复杂句子(23.2%)。每个条目包含文本、科目、阶段、内容类型、单词计数、来源ID和生成时间戳。数据集分为训练集(80%)、验证集(10%)和测试集(10%)。
MALAYMMLU is a comprehensive Malaysian educational pretraining dataset containing 9,429 entries with 3,187,502 words. The dataset is designed specifically for training language models on Malaysian educational content across STEM and Humanities subjects. It includes 25 educational subjects with Bahasa Melayu as the primary language and English as secondary. The content is distributed across three stages: Atomic Knowledge (12.7%), Comprehensive Materials (40.1%), and Complex Sentences (23.2%). Each entry contains text, subject, stage, content type, word count, source ID, and generation timestamp. The dataset is split into training (80%), validation (10%), and test (10%) sets.
提供机构:
khursani8



