five

UMxYTLAILabs/MalayMMLU

收藏
Hugging Face2024-12-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/UMxYTLAILabs/MalayMMLU
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - question-answering language: - ms tags: - knowledge pretty_name: MalayMMLU size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: eval path: - "MalayMMLU_0shot.json" - "MalayMMLU_1shot.json" - "MalayMMLU_2shot.json" - "MalayMMLU_3shot.json" --- # MalayMMLU Released on September 27, 2024 <h4 align="center"> <p> <b href="https://huggingface.co/datasets/UMxYTLAILabs/MalayMMLU">English</b> | <a href="https://huggingface.co/datasets/UMxYTLAILabs/MalayMMLU/blob/main/README_ms.md">Bahasa Melayu</a> <p> <p align="center" style="display: flex; flex-direction: row; justify-content: center; align-items: center"> 📄 <a href="https://github.com/UMxYTL-AI-Labs/MalayMMLU/blob/main/MalayMMLU_paper.pdf" target="_blank" style="margin-right: 15px; margin-left: 10px">Paper</a> • <!-- 🤗 <a href="https://huggingface.co/datasets/UMxYTLAILabs/MalayMMLU" target="_blank" style="margin-left: 10px;margin-right: 10px">Dataset</a> • --> <img src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" alt="GitHub logo" style="width: 25px; height: 25px;margin-left: 5px;margin-right: 10px"><a href="https://github.com/UMxYTL-AI-Labs/MalayMMLU" target="_blank" style="margin-right: 15px;">Code</a> • 📜 <a href="https://huggingface.co/datasets/UMxYTLAILabs/MalayMMLU/blob/main/MalayMMLU_Poster.pdf" target="_blank" style="margin-left: 10px">Poster</a> </h4> ## Introduction MalayMMLU is the first multitask language understanding (MLU) for Malay Language. The benchmark comprises 24,213 questions spanning both primary (Year 1-6) and secondary (Form 1-5) education levels in Malaysia, encompassing 5 broad topics that further divide into 22 subjects. <p align="center"> <img src="imgs/MalayMMLU.png" width="250" > </p> | **Category** | **Subjects** | |----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **STEM** | Computer Science (Secondary), Biology (Secondary), Chemistry (Secondary), Computer Literacy (Secondary), Mathematics (Primary, Secondary), Additional Mathematics (Secondary), Design and Technology (Primary, Secondary), Core Science (Primary, Secondary), Information and Communication Technology (Primary), Automotive Technology (Secondary) | | **Language** | Malay Language (Primary, Secondary) | | **Social science** | Geography (Secondary), Local Studies (Primary), History (Primary, Secondary) | | **Others** | Life Skills (Primary, Secondary), Principles of Accounting (Secondary), Economics (Secondary), Business (Secondary), Agriculture (Secondary) | | **Humanities** | Quran and Sunnah (Secondary), Islam (Primary, Secondary), Sports Science Knowledge (Secondary) | ## Result #### Zero-shot results of LLMs on MalayMMLU (First token accuracy) <table> <thead> <tr> <th rowspan="2">Organization</th> <th rowspan="2">Model</th> <th rowspan="2">Vision</th> <th colspan="7">Acc.</th> </tr> <tr> <th>Language</th> <th>Humanities</th> <th>STEM</th> <th>Social Science</th> <th>Others</th> <th>Average</th> </tr> </thead> <tbody> <tr> <td></td> <td>Random</td> <td></td> <td>38.01</td> <td>42.09</td> <td>36.31</td> <td>36.01</td> <td>38.07</td> <td>38.02</td> </tr> <tr> <td>YTL</td> <td style="font-family: sans-serif;">Ilmu 0.1</td> <td></td> <td><strong>87.77</strong></td> <td><strong>89.26</strong></td> <td><strong>86.66</strong></td> <td><strong>85.27</strong></td> <td><strong>86.40</strong></td> <td><strong>86.98</strong></td> </tr> <tr> <td rowspan="4">OpenAI</td> <td>GPT-4o</td> <td style="color: green; text-align: center"><b>&#10003</b></td> <td><ins>87.12</ins></td> <td><ins>88.12</ins></td> <td><ins>83.83</ins></td> <td><ins>82.58</ins></td> <td><ins>83.09</ins></td> <td><ins>84.98</ins></td> </tr> <tr> <td>GPT-4</td> <td style="color: green; text-align: center"><b>&#10003</b></td> <td>82.90</td> <td>83.91</td> <td>78.80</td> <td>77.29</td> <td>77.33</td> <td>80.11</td> </tr> <tr> <td>GPT-4o mini</td> <td style="color: green; text-align: center"><b>&#10003</b></td> <td>82.03</td> <td>81.50</td> <td>78.51</td> <td>75.67</td> <td>76.30</td> <td>78.78</td> </tr> <tr> <td>GPT-3.5</td> <td></td> <td>69.62</td> <td>71.01</td> <td>67.17</td> <td>66.70</td> <td>63.73</td> <td>67.78</td> </tr> <tr> <td rowspan="8">Meta</td> <td>LLaMA-3.1 (70B)</td> <td></td> <td>78.75</td> <td>82.59</td> <td>78.96</td> <td>77.20</td> <td>75.32</td> <td>78.44</td> </tr> <tr> <td>LLaMA-3.3 (70B)</td> <td></td> <td>78.82</td> <td>80.46</td> <td>78.71</td> <td>75.79</td> <td>73.85</td> <td>77.38</td> </tr> <tr> <td>LLaMA-3.1 (8B)</td> <td></td> <td>65.47</td> <td>67.17</td> <td>64.10</td> <td>62.59</td> <td>62.13</td> <td>64.24</td> </tr> <tr> <td>LLaMA-3 (8B)</td> <td></td> <td>63.93</td> <td>66.21</td> <td>62.26</td> <td>62.97</td> <td>61.38</td> <td>63.46</td> </tr> <tr> <td>LLaMA-2 (13B)</td> <td></td> <td>45.58</td> <td>50.72</td> <td>44.13</td> <td>44.55</td> <td>40.87</td> <td>45.26</td> </tr> <tr> <td>LLaMA-2 (7B)</td> <td></td> <td>47.47</td> <td>52.74</td> <td>48.71</td> <td>50.72</td> <td>48.19</td> <td>49.61</td> </tr> <tr> <td>LLaMA-3.2 (3B)</td> <td></td> <td>58.52</td> <td>60.66</td> <td>56.65</td> <td>54.06</td> <td>52.75</td> <td>56.45</td> </tr> <tr> <td>LLaMA-3.2 (1B)</td> <td></td> <td>38.88</td> <td>43.30</td> <td>40.65</td> <td>40.56</td> <td>39.55</td> <td>40.46</td> </tr> <tr> <td rowspan="8">Qwen (Alibaba)</td> <td>Qwen 2.5 (72B)</td> <td></td> <td>79.09</td> <td>79.95</td> <td>80.88</td> <td>75.80</td> <td>75.05</td> <td>77.79</td> </tr> <tr> <td>Qwen-2.5 (32B)</td> <td></td> <td>76.96</td> <td>76.70</td> <td>79.74</td> <td>72.35</td> <td>70.88</td> <td>74.83</td> </tr> <tr> <td>Qwen-2-VL (7B)</td> <td style="color: green; text-align: center"><b>&#10003</b></td> <td>68.16</td> <td>63.62</td> <td>67.58</td> <td>60.38</td> <td>59.08</td> <td>63.49</td> </tr> <tr> <td>Qwen-2-VL (2B)</td> <td style="color: green; text-align: center"><b>&#10003</b></td> <td>58.22</td> <td>55.56</td> <td>57.51</td> <td>53.67</td> <td>55.10</td> <td>55.83</td> </tr> <tr> <td>Qwen-1.5 (14B)</td> <td></td> <td>64.47</td> <td>60.64</td> <td>61.97</td> <td>57.66</td> <td>58.05</td> <td>60.47</td> </tr> <tr> <td>Qwen-1.5 (7B)</td> <td></td> <td>60.13</td> <td>59.14</td> <td>58.62</td> <td>54.26</td> <td>54.67</td> <td>57.18</td> </tr> <tr> <td>Qwen-1.5 (4B)</td> <td></td> <td>48.39</td> <td>52.01</td> <td>51.37</td> <td>50.00</td> <td>49.10</td> <td>49.93</td> </tr> <tr> <td>Qwen-1.5 (1.8B)</td> <td></td> <td>42.70</td> <td>43.37</td> <td>43.68</td> <td>43.12</td> <td>44.42</td> <td>43.34</td> </tr> <tr> <td rowspan="5">Zhipu</td> <td>GLM-4-Plus</td> <td></td> <td>78.04</td> <td>75.63</td> <td>77.49</td> <td>74.07</td> <td>72.66</td> <td>75.48</td> </tr> <tr> <td>GLM-4-Air</td> <td></td> <td>67.88</td> <td>69.56</td> <td>70.20</td> <td>66.06</td> <td>66.18</td> <td>67.60</td> </tr> <tr> <td>GLM-4-Flash</td> <td></td> <td>63.52</td> <td>65.69</td> <td>66.31</td> <td>63.21</td> <td>63.59</td> <td>64.12</td> </tr> <tr> <td>GLM-4</td> <td></td> <td>63.39</td> <td>56.72</td> <td>54.40</td> <td>57.24</td> <td>55.00</td> <td>58.07</td> </tr> <tr> <td>GLM-4<sup>††</sup> (9B)</td> <td></td> <td>58.51</td> <td>60.48</td> <td>56.32</td> <td>55.04</td> <td>53.97</td> <td>56.87</td> </tr> <tr> <td rowspan="3">Google</td> <td>Gemma-2 (9B)</td> <td></td> <td>75.83</td> <td>72.83</td> <td>75.07</td> <td>69.72</td> <td>70.33</td> <td>72.51</td> </tr> <tr> <td>Gemma (7B)</td> <td></td> <td>45.53</td> <td>50.92</td> <td>46.13</td> <td>47.33</td> <td>46.27</td> <td>47.21</td> </tr> <tr> <td>Gemma (2B)</td> <td></td> <td>46.50</td> <td>51.15</td> <td>49.20</td> <td>48.06</td> <td>48.79</td> <td>48.46</td> </tr> <tr> <td rowspan="2">SAIL (Sea)</td> <td>Sailor<sup>†</sup> (14B)</td> <td></td> <td>78.40</td> <td>72.88</td> <td>69.63</td> <td>69.47</td> <td>68.67</td> <td>72.29</td> </tr> <tr> <td>Sailor<sup>†</sup> (7B)</td> <td></td> <td>74.54</td> <td>68.62</td> <td>62.79</td> <td>64.69</td> <td>63.61</td> <td>67.58</td> </tr> <tr> <td rowspan="3">Mesolitica</td> <td>MaLLaM-v2.5 Small<sup>‡</sup></td> <td></td> <td>73.00</td> <td>71.00</td> <td>70.00</td> <td>72.00</td> <td>70.00</td> <td>71.53</td> </tr> <td>MaLLaM-v2.5 Tiny<sup>‡</sup></td> <td></td> <td>67.00</td> <td>66.00</td> <td>68.00</td> <td>69.00</td> <td>66.00</td> <td>67.32</td> </tr> <td>MaLLaM-v2<sup>†</sup> (5B)</td> <td></td> <td>42.57</td> <td>46.44</td> <td>42.24</td> <td>40.82</td> <td>38.74</td> <td>42.08</td> </tr> <tr> <td>Cohere for AI</td> <td>Command R (32B)</td> <td></td> <td>71.68</td> <td>71.49</td> <td>66.68</td> <td>67.19</td> <td>63.64</td> <td>68.47</td> </tr> <tr> <td>OpenGVLab</td> <td>InternVL2 (40B)</td> <td style="color: green; text-align: center"><b>&#10003</b></td> <td>70.36</td> <td>68.49</td> <td>64.88</td> <td>65.93</td> <td>60.54</td> <td>66.51</td> </tr> <tr> <td>Damo (Alibaba)</td> <td>SeaLLM-v2.5<sup>†</sup> (7B)</td> <td></td> <td>69.75</td> <td>67.94</td> <td>65.29</td> <td>62.66</td> <td>63.61</td> <td>65.89</td> </tr> <tr> <td rowspan="4">Mistral</td> <td>Pixtral (12B)</td> <td style="color: green; text-align: center"><b>&#10003</b></td> <td>64.81</td> <td>62.68</td> <td>64.72</td> <td>63.93</td> <td>59.49</td> <td>63.25</td> </tr> <tr> <td>Mistral Small (22B)</td> <td></td> <td>65.19</td> <td>65.03</td> <td>63.36</td> <td>61.58</td> <td>59.99</td> <td>63.05</td> </tr> <tr> <td>Mistral-v0.3 (7B)</td> <td></td> <td>56.97</td> <td>59.29</td> <td>57.14</td> <td>58.28</td> <td>56.56</td> <td>57.71</td> </tr> <tr> <td>Mistral-v0.2 (7B)</td> <td></td> <td>56.23</td> <td>59.86</td> <td>57.10</td> <td>56.65</td> <td>55.22</td> <td>56.92</td> </tr> <tr> <td rowspan="2">Microsoft</td> <td>Phi-3 (14B)</td> <td></td> <td>60.07</td> <td>58.89</td> <td>60.91</td> <td>58.73</td> <td>55.24</td> <td>58.72</td> </tr> <tr> <td>Phi-3 (3.8B)</td> <td></td> <td>52.24</td> <td>55.52</td> <td>54.81</td> <td>53.70</td> <td>51.74</td> <td>53.43</td> </tr> <tr> <td>01.AI</td> <td>Yi-1.5 (9B)</td> <td></td> <td>56.20</td> <td>53.36</td> <td>57.47</td> <td>50.53</td> <td>49.75</td> <td>53.08</td> </tr> <tr> <td rowspan="2">Stability AI</td> <td>StableLM 2 (12B)</td> <td></td> <td>53.40</td> <td>54.84</td> <td>51.45</td> <td>51.79</td> <td>50.16</td> <td>52.45</td> </tr> <tr> <td>StableLM 2 (1.6B)</td> <td></td> <td>43.92</td> <td>51.10</td> <td>45.27</td> <td>46.14</td> <td>46.75</td> <td>46.48</td> </tr> <tr> <td>Baichuan</td> <td>Baichuan-2 (7B)</td> <td></td> <td>40.41</td> <td>47.35</td> <td>44.37</td> <td>46.33</td> <td>43.54</td> <td>44.30</td> </tr> <tr> <td>Yellow.ai</td> <td>Komodo<sup>†</sup> (7B)</td> <td></td> <td>43.62</td> <td>45.53</td> <td>39.34</td> <td>39.75</td> <td>39.48</td> <td>41.72</td> </tr> </tbody> </table> Highest scores are <strong>bolded</strong> and second highest scores are <ins>underlined</ins>. † denotes LLMs fine-tuned with Southeast Asia datasets. †† denotes open-source GLM-4. ‡ result from https://mesolitica.com/mallam. ## Citation ```bibtex @InProceedings{MalayMMLU2024, author = {Poh, Soon Chang and Yang, Sze Jue and Tan, Jeraelyn Ming Li and Chieng, Lawrence Leroy Tze Yao and Tan, Jia Xuan and Yu, Zhenyu and Foong, Chee Mun and Chan, Chee Seng }, title = {MalayMMLU: A Multitask Benchmark for the Low-Resource Malay Language}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024}, month = {November}, year = {2024}, } ``` ## Feedback Suggestions and opinions (both positive and negative) are greatly welcome. Please contact the author by sending email to `cs.chan at um.edu.my`.
提供机构:
UMxYTLAILabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作