MMLU-SemiPro

Name: MMLU-SemiPro
Creator: maas
Published: 2025-12-03 17:20:49
License: 暂无描述

魔搭社区2025-12-03 更新2024-12-21 收录

下载链接：

https://modelscope.cn/datasets/answerdotai/MMLU-SemiPro

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is derived from [TIGER-Lab/MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) as part of our [MMLU-Leagues]() Encoder benchmark series, containing: - [MMLU-Amateur](https://huggingface.co/datasets/answerdotai/MMLU-Amateur), where the train set contains all questions Llama-3-8B-Instruct (5-shot) gets wrong and the test set contains all questions it gets right. The aim is to measure the ability of an encoder, with relatively limited training data, to match the performance of a small frontier model. - **MMLU-SemiPro** (this dataset), where the data is evenly split between a train and a test set. Both splits contain exactly the same proportions of questions that Llama-3-8B-Instruct (5-shot) answers correctly, to ensure an even difficulty distribution. The data is stratified into categories, to ensure that there's the exact same number (+/-) of questions from each category in both splits. This dataset was processed with the following script: ```python from datasets import Dataset, load_dataset import srsly from sklearn.model_selection import GroupKFold # Load original MMLU data_df = load_dataset("TIGER-Lab/MMLU-Pro", split="test").to_pandas() # Load llama cached predictions # You can get the llama outputs from https://github.com/TIGER-AI-Lab/MMLU-Pro/blob/main/eval_results/model_outputs_Meta-Llama-3-8B-Instruct_5shots.json llama_outputs = srsly.read_json('llm_outputs/model_outputs_Meta-Llama-3-8B-Instruct_5shots.json') # Enrich the df with the llama predictions llama_pred_dict = {item['question_id']: item['pred'] for item in llama_outputs} data_df['llama_pred'] = data_df['question_id'].map(llama_pred_dict) data_df['llama_pred'] = data_df['llama_pred'].fillna("NoAnswer") data_df['llama_correct'] = data_df.apply(lambda row: row['llama_pred'] == row['answer'], axis=1) data_df = data_df.reset_index(drop=True) # Filter down to only questions with exactly 10 answers data_df = data_df[data_df["options"].apply(len) == 10].copy() data_df = data_df.reset_index(drop=True) # train-test split from sklearn.model_selection import GroupShuffleSplit def add_fold(df, group_col="category", fold_method="semipro"): if fold_method not in ["amateur", "semipro"]: raise ValueError("fold_method must be either 'amateur' or 'semipro'") if fold_method == "amateur": df["kfold"] = df["llama_correct"].astype(int) return df # truncated ... return df amateur_processed_df = add_fold(data_df, fold_method="amateur") amateur_test_df = amateur_processed_df[amateur_processed_df["kfold"] == 1].drop(columns="kfold") amateur_train_df = amateur_processed_df[amateur_processed_df["kfold"] == 0].drop(columns="kfold") amateur_train_ds = Dataset.from_pandas(amateur_train_df, preserve_index=False) amateur_test_ds = Dataset.from_pandas(amateur_test_df, preserve_index=False) # Sanity check: Ensure all llama_correct == True are in test, and all llama_correct == False are in train test_correct = amateur_test_df['llama_correct'].all() train_incorrect = (amateur_train_df['llama_correct'] == False).all() assert test_correct, "Not all examples in the test set have llama_correct == True" assert train_incorrect, "Not all examples in the train set have llama_correct == False" print("Sanity check passed: All llama_correct == True are in test, and all llama_correct == False are in train.") amateur_processed_df = add_fold(data_df, fold_method="amateur") amateur_test_df = amateur_processed_df[amateur_processed_df["kfold"] == 1].drop(columns="kfold") amateur_train_df = amateur_processed_df[amateur_processed_df["kfold"] == 0].drop(columns="kfold") amateur_train_ds = Dataset.from_pandas(amateur_train_df, preserve_index=False) amateur_test_ds = Dataset.from_pandas(amateur_test_df, preserve_index=False) # Sanity check: Ensure all llama_correct == True are in test, and all llama_correct == False are in train test_correct = amateur_test_df['llama_correct'].all() train_incorrect = (amateur_train_df['llama_correct'] == False).all() assert test_correct, "Not all examples in the test set have llama_correct == True" assert train_incorrect, "Not all examples in the train set have llama_correct == False" print("Sanity check passed: All llama_correct == True are in test, and all llama_correct == False are in train.") ```

本数据集源自[TIGER-Lab/MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)，属于我们推出的[MMLU-Leagues]()编码器基准测试系列，具体包含以下两个子数据集： - **MMLU-Amateur**（[访问链接](https://huggingface.co/datasets/answerdotai/MMLU-Amateur)）：其训练集涵盖Llama-3-8B-Instruct（少样本（Few-shot），即5样本设置）答错的全部题目，测试集则包含该模型答对的全部题目。该子数据集旨在评估训练数据相对有限的编码器，能否匹配小型前沿大语言模型（Large Language Model）的性能表现。 - **MMLU-SemiPro**（即本数据集）：数据在训练集与测试集间实现均匀划分，且两个划分中Llama-3-8B-Instruct（少样本（Few-shot））答对题目的比例完全一致，以此确保数据集的难度分布均衡。本数据集已按类别进行分层处理，确保两个划分中每个类别的题目数量完全一致（误差范围±1）。本数据集通过以下脚本完成处理： python from datasets import Dataset, load_dataset import srsly from sklearn.model_selection import GroupKFold # 加载原始MMLU数据集 data_df = load_dataset("TIGER-Lab/MMLU-Pro", split="test").to_pandas() # 加载Llama模型的缓存预测结果 # 可从以下链接获取Llama模型的输出：https://github.com/TIGER-AI-Lab/MMLU-Pro/blob/main/eval_results/model_outputs_Meta-Llama-3-8B-Instruct_5shots.json llama_outputs = srsly.read_json('llm_outputs/model_outputs_Meta-Llama-3-8B-Instruct_5shots.json') # 为DataFrame补充Llama模型的预测结果 llama_pred_dict = {item['question_id']: item['pred'] for item in llama_outputs} data_df['llama_pred'] = data_df['question_id'].map(llama_pred_dict) data_df['llama_pred'] = data_df['llama_pred'].fillna("NoAnswer") data_df['llama_correct'] = data_df.apply(lambda row: row['llama_pred'] == row['answer'], axis=1) data_df = data_df.reset_index(drop=True) # 仅保留选项数量恰好为10的题目 data_df = data_df[data_df["options"].apply(len) == 10].copy() data_df = data_df.reset_index(drop=True) # 训练集-测试集划分 from sklearn.model_selection import GroupShuffleSplit def add_fold(df, group_col="category", fold_method="semipro"): if fold_method not in ["amateur", "semipro"]: raise ValueError("fold_method must be either 'amateur' or 'semipro'") if fold_method == "amateur": df["kfold"] = df["llama_correct"].astype(int) return df # truncated ... return df amateur_processed_df = add_fold(data_df, fold_method="amateur") amateur_test_df = amateur_processed_df[amateur_processed_df["kfold"] == 1].drop(columns="kfold") amateur_train_df = amateur_processed_df[amateur_processed_df["kfold"] == 0].drop(columns="kfold") amateur_train_ds = Dataset.from_pandas(amateur_train_df, preserve_index=False) amateur_test_ds = Dataset.from_pandas(amateur_test_df, preserve_index=False) # 合理性校验：确保所有llama_correct为True的样本均位于测试集，llama_correct为False的样本均位于训练集 test_correct = amateur_test_df['llama_correct'].all() train_incorrect = (amateur_train_df['llama_correct'] == False).all() assert test_correct, "Not all examples in the test set have llama_correct == True" assert train_incorrect, "Not all examples in the train set have llama_correct == False" print("合理性校验通过：所有llama_correct为True的样本均位于测试集，llama_correct为False的样本均位于训练集。") amateur_processed_df = add_fold(data_df, fold_method="amateur") amateur_test_df = amateur_processed_df[amateur_processed_df["kfold"] == 1].drop(columns="kfold") amateur_train_df = amateur_processed_df[amateur_processed_df["kfold"] == 0].drop(columns="kfold") amateur_train_ds = Dataset.from_pandas(amateur_train_df, preserve_index=False) amateur_test_ds = Dataset.from_pandas(amateur_test_df, preserve_index=False) # 合理性校验：确保所有llama_correct为True的样本均位于测试集，llama_correct为False的样本均位于训练集 test_correct = amateur_test_df['llama_correct'].all() train_incorrect = (amateur_train_df['llama_correct'] == False).all() assert test_correct, "Not all examples in the test set have llama_correct == True" assert train_incorrect, "Not all examples in the train set have llama_correct == False" print("合理性校验通过：所有llama_correct为True的样本均位于测试集，llama_correct为False的样本均位于训练集。")

提供机构：

maas

创建时间：

2024-12-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集