Llama-3.2-3B-evals

Name: Llama-3.2-3B-evals
Creator: maas
Published: 2026-01-06 16:16:58
License: 暂无描述

魔搭社区2026-01-06 更新2024-10-05 收录

下载链接：

https://modelscope.cn/datasets/LLM-Research/Llama-3.2-3B-evals

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Meta Evaluation Result Details for Llama-3.2-3B  This dataset contains the results of the Meta evaluation result details for **Llama-3.2-3B**. The dataset has been created from 8 evaluation tasks. The tasks are: needle_in_haystack, mmlu, squad, quac, drop, arc_challenge, multi_needle, agieval_english. Each task detail can be found as a specific subset in each configuration nd each subset is named using the task name plus the timestamp of the upload time and ends with "__details". For more information about the eval tasks, please refer to this [eval details](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md) page. You can use the Viewer feature to view the dataset in the web browser easily. For most tasks, we provide an "is_correct" column, so you can quickly get our accuracy result of the task by viewing the percentage of "is_correct=True". For tasks that have both binary (eg. exact_match) and a continuous metrics (eg. f1), we will only consider the binary metric for adding the is_correct column. This might differ from the reported metric in the Llama 3.2 model card. Additionally, there is a model metrics subset that contains all the reported metrics, like f1, macro_avg/acc, for all the tasks and subtasks. Please use this subset to find reported metrics in the model card. Lastly, you can also use Huggingface Dataset APIs to load the dataset. For example, to load a eval task detail, you can use the following code: ```python from datasets import load_dataset data = load_dataset("meta-llama/Llama-3.2-3B-evals", name="Llama-3.2-3B-evals__agieval_english__details", split="latest" ) ``` Please check our [eval recipe](https://github.com/meta-llama/llama-recipes/tree/main/tools/benchmarks/llm_eval_harness/meta_eval_reproduce) that demonstrates how to calculate the Llama 3.2 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library on selected tasks. Here are the detailed explanation for each column of the task eval details: **task_type**: Whether the eval task was run as a ‘Generative’ or ‘Choice’ task. Generative task returns the model output, whereas for choice tasks we return the negative log likelihoods of the completion. (The choice task approach is typically used for multiple choice tasks for non-instruct models) **task_name**: Meta internal eval task name **subtask_name**: Meta internal subtask name in cases where the benchmark has subcategories (Ex. MMLU with domains) **input_question**: The question from the input dataset when available. In cases when that data is overwritten as a part of the evaluation pipeline or it is a complex concatenation of input dataset fields, this will be the serialized prompt object as a string. **input_choice_list**: In the case of multiple choice questions, this contains a map of the choice name to the text. **input_final_prompt**: The final input text that is provided to the model for inference. For choice tasks, this will be an array of prompts provided to the model, where we calculate the likelihoods of the different completions in order to get the final answer provided by the model. **input_correct_responses**: An array of correct responses to the input question. **output_prediction_text**: The model output for a Generative task **output_parsed_answer**: The answer we’ve parsed from the model output or calculated using negative log likelihoods. **output_choice_completions**: For choice tasks, the list of completions we’ve provided to the model to calculate negative log likelihoods **output_choice_negative_log_likelihoods**: For choice tasks, these are the corresponding negative log likelihoods normalized by different sequence lengths (text, token, raw) for the above completions. **output_metrics**: Metrics calculated at the example level. Common metrics include: acc - accuracy em - exact_match f1 - F1 score pass@1 - For coding benchmarks, whether the output code passes tests **is_correct**: Whether the parsed answer matches the target responses and consider correct. (Only applicable for benchmarks which have such a boolean metric) **input_question_hash**: The SHA256 hash of the question text encoded as UTF-8 **input_final_prompts_hash**: An array of SHA256 hash of the input prompt text encoded as UTF-8 **benchmark_label**: The commonly used benchmark name **eval_config**: Additional metadata related to the configurations we used to run this evaluation num_generations - Generation parameter - how many outputs to generate num_shots - How many few shot examples to include in the prompt. max_gen_len - generation parameter (how many tokens to generate) prompt_fn - The prompt function with jinja template when available max_prompt_len - Generation parameter. Maximum number tokens for the prompt. If the input_final_prompt is longer than this configuration, we will truncate return_logprobs - Generation parameter - Whether to return log probabilities when generating output.

# Llama-3.2-3B 元评估结果细节数据集卡片  本数据集收录了 **Llama-3.2-3B** 的元评估结果细节。数据集源自8项评估任务，具体包括：needle_in_haystack、mmlu、squad、quac、drop、arc_challenge、multi_needle、agieval_english。每项任务的细节可在各配置对应的特定子集中找到，每个子集以任务名加上上传时间戳命名，并以`__details`结尾。如需了解评估任务的更多信息，请参阅此[评估细节页面](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md)。您可通过查看器功能在网页浏览器中便捷浏览该数据集。对于多数任务，我们提供了`is_correct`列，您可通过统计`is_correct=True`的占比快速获取该任务的准确率结果。对于同时包含二元指标（如exact_match）和连续指标（如F1）的任务，我们仅会采用二元指标来生成`is_correct`列，这可能与Llama 3.2模型卡片中公布的指标有所不同。此外，还有一个模型指标子集，包含所有任务及子任务的已公布指标，如F1、macro_avg/acc等。请通过该子集获取模型卡片中公布的指标。最后，您也可使用Huggingface数据集API加载该数据集。例如，若要加载某项评估任务的细节，可使用以下代码： python from datasets import load_dataset data = load_dataset("meta-llama/Llama-3.2-3B-evals", name="Llama-3.2-3B-evals__agieval_english__details", split="latest" ) 请查阅我们的[评估流程指南](https://github.com/meta-llama/llama-recipes/tree/main/tools/benchmarks/llm_eval_harness/meta_eval_reproduce)，该指南演示了如何使用[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main)库针对选定任务计算Llama 3.2公布的基准测试数值。以下为任务评估细节各列的详细说明： **任务类型（task_type）**：评估任务以"生成式（Generative）"还是"选择式（Choice）"运行。生成式任务会返回模型输出，而选择式任务则返回补全的负对数似然值。（选择式任务通常用于非指令模型的多项选择任务） **任务名称（task_name）**：Meta内部使用的评估任务名称 **子任务名称（subtask_name）**：当基准测试包含子类别时，Meta内部使用的子任务名称（例如带领域划分的MMLU） **输入问题（input_question）**：输入数据集中的问题（若可用）。若该数据已在评估流程中被改写，或是输入数据集字段的复杂拼接，则该字段将为序列化后的提示词对象字符串。 **输入选项列表（input_choice_list）**：针对多项选择题，该字段包含选项名称到选项文本的映射。 **最终输入提示词（input_final_prompt）**：提供给模型用于推理的最终输入文本。对于选择式任务，该字段为提供给模型的提示词数组，我们通过计算不同补全的似然值来获取模型给出的最终答案。 **正确响应（input_correct_responses）**：输入问题对应的正确响应数组。 **模型输出文本（output_prediction_text）**：生成式任务的模型输出结果 **解析后的答案（output_parsed_answer）**：从模型输出中解析得到的答案，或通过负对数似然值计算得到的答案。 **选择式任务补全列表（output_choice_completions）**：针对选择式任务，提供给模型用于计算负对数似然值的补全项列表。 **选择式任务负对数似然值（output_choice_negative_log_likelihoods）**：针对选择式任务，上述补全项对应的负对数似然值，已按不同序列长度（文本、Token（Token）、原始值）进行归一化。 **输出指标（output_metrics）**：按样本级别计算的指标。常见指标包括： acc - 准确率（accuracy） em - 精确匹配度（exact_match） f1 - F1分数 pass@1 - 针对编码基准测试，输出的代码是否通过测试 **正确性标记（is_correct）**：解析后的答案是否与目标响应一致，以此判定结果正确与否。（仅适用于支持该布尔指标的基准测试） **输入问题哈希值（input_question_hash）**：以UTF-8编码的问题文本的SHA256哈希值 **最终输入提示词哈希值（input_final_prompts_hash）**：以UTF-8编码的输入提示词文本的SHA256哈希值数组 **基准测试标签（benchmark_label）**：通用的基准测试名称 **评估配置（eval_config）**：与本次评估运行配置相关的附加元数据： num_generations - 生成参数：需生成的输出数量 num_shots - 提示词中包含的少样本（Few-shot）示例数量 max_gen_len - 生成参数：需生成的Token（Token）数量 prompt_fn - 提示词函数（若可用，包含Jinja模板） max_prompt_len - 生成参数：提示词的最大Token（Token）数。若input_final_prompt长度超过该配置，则会对其进行截断 return_logprobs - 生成参数：生成输出时是否返回对数概率

提供机构：

maas

创建时间：

2024-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集