Meta-Llama-3.1-405B-Instruct-evals
收藏魔搭社区2026-01-06 更新2024-10-12 收录
下载链接:
https://modelscope.cn/datasets/LLM-Research/Meta-Llama-3.1-405B-Instruct-evals
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Llama-3.1-405B-Instruct Evaluation Result Details
<!-- Provide a quick summary of the dataset. -->
This dataset contains the Meta evaluation result details for **Llama-3.1-405B-Instruct**. The dataset has been created from 30 evaluation tasks. These tasks are human_eval, gorilla_api_bench__huggingface, mmlu_pro, infinite_bench, api_bank, human_eval_plus, ifeval__loose, mmlu__0_shot__cot, nih__multi_needle, multilingual_mmlu_de, mmlu, gsm8k, mgsm, multilingual_mmlu_fr, multilingual_mmlu_pt, math_hard, multilingual_mmlu_es, gpqa, zero_shot_scrolls, ifeval__strict, nexus, math, mbpp_plus, multilingual_mmlu_th, arc_challenge, multilingual_mmlu_hi, gorilla_api_bench__torchub, multilingual_mmlu_it, gorilla_api_bench__tensorhub, mbpp.
Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and ends with "__details".
For more information about the eval tasks, please refer to this [eval details](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md) page.
You can use the Viewer feature to view the dataset in the web browser easily. For most tasks, we provide an "**is_correct**" column, so you can quickly get our accuracy result of the task by viewing the percentage of "is_correct=True". For tasks that have both binary (eg. exact_match) and a continuous metrics (eg. f1), we will only consider the binary metric for adding the is_correct column. This might differ from the reported metric in the Llama 3.1 model card.
Additionally, there is a model metrics subset that contains all the reported metrics, like f1, macro_avg/acc, for all the tasks and subtasks. Please use this subset to find reported metrics in the model card.
Lastly, you can also use Huggingface Dataset APIs to load the dataset. For example, to load a eval task detail, you can use the following code:
```python
from datasets import load_dataset
data = load_dataset(
"meta-llama/Llama-3.1-405B-Instruct-evals",
name="Llama-3.1-405B-Instruct-evals__mbpp__details",
split="latest"
)
```
Please check our [eval reproduction recipe](https://github.com/meta-llama/llama-recipes/tree/b5f64c0b69d7ff85ec186d964c6c557d55025969/tools/benchmarks/llm_eval_harness/meta_eval_reproduce) that demonstrates how to closely reproduce the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and the datasets in [3.1 evals collections](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f) on selected tasks.
Here are the explanations for each column in the task eval details:
**task_type**: Whether the eval task was run as a ‘Generative’ or ‘Choice’ task. Generative task returns the model output, whereas for choice tasks we return the negative log likelihoods of the completion. (The choice task approach is typically used for multiple choice tasks for non-instruct models)
**task_name**: Meta's internal eval task name
**subtask_name**: Meta internal subtask name in cases where the benchmark has subcategories (Ex. MMLU with domains)
**input_question**: The question from the input dataset when available. In cases when that data is overwritten as a part of the evaluation pipeline or it is a complex concatenation of input dataset fields, this will be the serialized prompt object as a string.
**input_choice_list**: In the case of multiple choice questions, this contains a map of the choice name to the text.
**input_final_prompt**: The final input text that is provided to the model for inference. For choice tasks, this will be an array of prompts provided to the model, where we calculate the likelihoods of the different completions in order to get the final answer provided by the model.
**input_correct_responses**: An array of correct responses to the input question.
**output_prediction_text**: The model output for a Generative task
**output_parsed_answer**: Typically answer we’ve parsed from the model output or calculated using negative log likelihoods. Sometimes this is a interpolation of answers parsed from the model output and fields from the benchmark example
**output_choice_completions**: For choice tasks, the list of completions we’ve provided to the model to calculate negative log likelihoods
**output_choice_negative_log_likelihoods**: For choice tasks, these are the corresponding negative log likelihoods normalized by different sequence lengths (text, token, raw) for the above completions.
**output_metrics**: Metrics calculated at the example level. Common metrics include:
acc - accuracy
em - exact_match
f1 - F1 score
pass@1 - For coding benchmarks, whether the output code passes tests
**is_correct**: Whether the parsed answer matches the target responses and is considered correct. (Only applicable for benchmarks which have such a boolean metric)
**input_question_hash**: The SHA256 hash of the question text encoded as UTF-8
**input_final_prompts_hash**: An array of SHA256 hash of the input prompt text encoded as UTF-8
**benchmark_label**: The commonly used benchmark name
**eval_config**: Additional metadata related to the configurations we used to run this evaluation:
num_generations - Generation parameter - how many outputs to generate
num_shots - How many few shot examples to include in the prompt.
max_gen_len - generation parameter (how many tokens to generate)
prompt_fn - The prompt function with jinja template when available
max_prompt_len - Generation parameter. Maximum number tokens for the prompt. If the input_final_prompt is longer than this configuration, we will truncat
return_logprobs - Generation parameter - Whether to return log probabilities when generating output.
# Llama-3.1-405B-Instruct 评估结果详情数据集卡片
<!-- 提供数据集的简要概述。 -->
本数据集涵盖Meta针对**Llama-3.1-405B-Instruct**的官方评估结果详情,共包含30项评估任务,具体任务如下:human_eval、gorilla_api_bench__huggingface、mmlu_pro、infinite_bench、api_bank、human_eval_plus、ifeval__loose、mmlu__0_shot__cot、nih__multi_needle、multilingual_mmlu_de、mmlu、gsm8k、mgsm、multilingual_mmlu_fr、multilingual_mmlu_pt、math_hard、multilingual_mmlu_es、gpqa、zero_shot_scrolls、ifeval__strict、nexus、math、mbpp_plus、multilingual_mmlu_th、arc_challenge、multilingual_mmlu_hi、gorilla_api_bench__torchub、multilingual_mmlu_it、gorilla_api_bench__tensorhub、mbpp。
每项任务的详情均以独立子集的形式存储在各配置中,子集命名格式为「任务名+上传时间戳」,并以「__details」作为后缀。
如需了解更多评估任务的相关信息,请参阅此[评估详情页](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md)。
您可通过浏览器内置的查看器(Viewer)功能便捷地浏览本数据集。多数任务均提供了「**is_correct**」列,您可通过统计「is_correct=True」的样本占比,快速获取该任务的准确率结果。对于同时包含二元指标(如exact_match)与连续指标(如F1分数)的任务,我们仅采用二元指标来生成is_correct列,该设置可能与Llama 3.1模型卡片中公布的指标存在差异。
此外,本数据集还包含一个模型指标子集,涵盖所有任务及子任务的已公布指标,例如F1分数(F1 score)、宏平均准确率/准确率等。您可通过该子集获取模型卡片中公布的各项指标。
您还可使用Huggingface数据集API加载本数据集。例如,若要加载某项评估任务的详情,可使用如下代码:
python
from datasets import load_dataset
data = load_dataset(
"meta-llama/Llama-3.1-405B-Instruct-evals",
name="Llama-3.1-405B-Instruct-evals__mbpp__details",
split="latest"
)
请参阅我们的[评估复现指南](https://github.com/meta-llama/llama-recipes/tree/b5f64c0b69d7ff85ec186d964c6c557d55025969/tools/benchmarks/llm_eval_harness/meta_eval_reproduce),该指南演示了如何使用[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main)库,以及[3.1版评估数据集合集](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f)中的数据集,在选定任务上复现Llama 3.1模型公布的基准测试结果。
以下为任务评估详情中各列的含义说明:
**task_type**:任务类型,分为「生成式(Generative)」或「选择题式(Choice)」。生成式任务会返回模型输出结果;而选择题式任务则会返回补全文本的负对数似然值(negative log likelihoods)(选择题式方法通常用于非指令模型的多选任务)。
**task_name**:Meta内部使用的评估任务名称。
**subtask_name**:当基准测试包含子类别时,Meta内部使用的子任务名称(例如带有领域划分的MMLU)。
**input_question**:输入数据集提供的问题(如可用)。若该数据已在评估流程中被改写,或是输入数据集字段的复杂拼接结果,则该字段为序列化后的提示词对象字符串。
**input_choice_list**:针对多项选择题,该字段存储选项名称与文本的映射关系。
**input_final_prompt**:提供给模型进行推理的最终输入文本。对于选择题式任务,该字段为模型接收的提示词数组,我们会通过计算不同补全文本的似然值来获取模型最终给出的答案。
**input_correct_responses**:输入问题的正确响应数组。
**output_prediction_text**:生成式任务的模型输出结果。
**output_parsed_answer**:通常为从模型输出中解析得到的答案,或是通过负对数似然值计算得到的结果。有时该字段为从模型输出中解析得到的答案与基准测试示例字段的插值结果。
**output_choice_completions**:针对选择题式任务,我们提供给模型用于计算负对数似然值的补全文本列表。
**output_choice_negative_log_likelihoods**:针对选择题式任务,该字段为上述补全文本对应的负对数似然值,已按不同序列长度(文本、Token、原始值)进行归一化处理。
**output_metrics**:在样本级别计算得到的指标,常见指标包括:
acc - 准确率(accuracy)
em - 精确匹配(exact_match)
f1 - F1分数(F1 score)
pass@1 - 针对编码基准测试,输出的代码是否通过测试。
**is_correct**:解析得到的答案是否与目标响应一致,即是否被判定为正确(仅适用于支持该布尔指标的基准测试)。
**input_question_hash**:以UTF-8编码的问题文本的SHA256哈希值(SHA256 hash)。
**input_final_prompts_hash**:以UTF-8编码的各输入提示词文本的SHA256哈希值(SHA256 hash)数组。
**benchmark_label**:通用的基准测试名称。
**eval_config**:运行本次评估所使用配置的附加元数据:
num_generations - 生成参数,即需要生成的输出结果数量。
num_shots - 提示词中包含的少样本示例数量。
max_gen_len - 生成参数,即最多生成的Token数量。
prompt_fn - 提示词函数,若可用则包含Jinja模板。
max_prompt_len - 生成参数,提示词的最大Token数。若input_final_prompt的长度超过该配置值,则会对其进行截断。
return_logprobs - 生成参数,即生成输出时是否返回对数概率值。
提供机构:
maas
创建时间:
2024-09-26



