Llama-3.2-1B-Instruct-evals
收藏魔搭社区2026-05-01 更新2024-10-05 收录
下载链接:
https://modelscope.cn/datasets/LLM-Research/Llama-3.2-1B-Instruct-evals
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Meta Evaluation Result Details for Llama-3.2-1B-Instruct
<!-- Provide a quick summary of the dataset. -->
This dataset contains the results of the Meta evaluation result details for **Llama-3.2-1B-Instruct**. The dataset has been created from 21 evaluation tasks. The tasks are: hellaswag_chat, infinite_bench, mmlu_hindi_chat, mmlu_portugese_chat, ifeval__loose, nih__multi_needle, mmlu, gsm8k, mgsm, mmlu_thai_chat, mmlu_spanish_chat, gpqa, bfcl_chat, mmlu_french_chat, ifeval__strict, nexus, math, arc_challenge, openrewrite_chat, mmlu_german_chat, mmlu_italian_chat.
Each task detail can be found as a specific subset in each configuration nd each subset is named using the task name plus the timestamp of the upload time and ends with "__details".
For more information about the eval tasks, please refer to this [eval details](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md) page.
You can use the Viewer feature to view the dataset in the web browser easily. For most tasks, we provide an "is_correct" column, so you can quickly get our accuracy result of the task by viewing the percentage of "is_correct=True". For tasks that have both binary (eg. exact_match) and a continuous metrics (eg. f1), we will only consider the binary metric for adding the is_correct column. This might differ from the reported metric in the Llama 3.2 model card.
Additionally, there is a model metrics subset that contains all the reported metrics, like f1, macro_avg/acc, for all the tasks and subtasks. Please use this subset to find reported metrics in the model card.
Lastly, you can also use Huggingface Dataset APIs to load the dataset. For example, to load a eval task detail, you can use the following code:
```python
from datasets import load_dataset
data = load_dataset("meta-llama/Llama-3.2-1B-Instruct-evals",
name="Llama-3.2-1B-Instruct-evals__mmlu_italian_chat__details",
split="latest"
)
```
Please check our [eval recipe](https://github.com/meta-llama/llama-recipes/tree/main/tools/benchmarks/llm_eval_harness/meta_eval_reproduce) that demonstrates how to calculate the Llama 3.2 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library on selected tasks.
Here are the detailed explanation for each column of the task eval details:
**task_type**: Whether the eval task was run as a ‘Generative’ or ‘Choice’ task. Generative task returns the model output, whereas for choice tasks we return the negative log likelihoods of the completion. (The choice task approach is typically used for multiple choice tasks for non-instruct models)
**task_name**: Meta internal eval task name
**subtask_name**: Meta internal subtask name in cases where the benchmark has subcategories (Ex. MMLU with domains)
**input_question**: The question from the input dataset when available. In cases when that data is overwritten as a part of the evaluation pipeline or it is a complex concatenation of input dataset fields, this will be the serialized prompt object as a string.
**input_choice_list**: In the case of multiple choice questions, this contains a map of the choice name to the text.
**input_final_prompt**: The final input text that is provided to the model for inference. For choice tasks, this will be an array of prompts provided to the model, where we calculate the likelihoods of the different completions in order to get the final answer provided by the model.
**input_correct_responses**: An array of correct responses to the input question.
**output_prediction_text**: The model output for a Generative task
**output_parsed_answer**: The answer we’ve parsed from the model output or calculated using negative log likelihoods.
**output_choice_completions**: For choice tasks, the list of completions we’ve provided to the model to calculate negative log likelihoods
**output_choice_negative_log_likelihoods**: For choice tasks, these are the corresponding negative log likelihoods normalized by different sequence lengths (text, token, raw) for the above completions.
**output_metrics**: Metrics calculated at the example level. Common metrics include:
acc - accuracy
em - exact_match
f1 - F1 score
pass@1 - For coding benchmarks, whether the output code passes tests
**is_correct**: Whether the parsed answer matches the target responses and consider correct. (Only applicable for benchmarks which have such a boolean metric)
**input_question_hash**: The SHA256 hash of the question text encoded as UTF-8
**input_final_prompts_hash**: An array of SHA256 hash of the input prompt text encoded as UTF-8
**benchmark_label**: The commonly used benchmark name
**eval_config**: Additional metadata related to the configurations we used to run this evaluation
num_generations - Generation parameter - how many outputs to generate
num_shots - How many few shot examples to include in the prompt.
max_gen_len - generation parameter (how many tokens to generate)
prompt_fn - The prompt function with jinja template when available
max_prompt_len - Generation parameter. Maximum number tokens for the prompt. If the input_final_prompt is longer than this configuration, we will truncate
return_logprobs - Generation parameter - Whether to return log probabilities when generating output.
# Llama-3.2-1B-Instruct 元评估结果详情数据集卡片
<!-- 提供数据集的快速概述 -->
本数据集收录了**Llama-3.2-1B-Instruct**的元评估结果详情,基于21项评估任务构建而成。具体任务列表如下:hellaswag_chat、infinite_bench、mmlu_hindi_chat、mmlu_portugese_chat、ifeval__loose、nih__multi_needle、mmlu、gsm8k、mgsm、mmlu_thai_chat、mmlu_spanish_chat、gpqa、bfcl_chat、mmlu_french_chat、ifeval__strict、nexus、math、arc_challenge、openrewrite_chat、mmlu_german_chat、mmlu_italian_chat。
每项任务详情均以特定子集形式存储于数据集中,子集命名规则为任务名称加上上传时间戳,并以`__details`结尾。
如需了解评估任务的更多信息,请参阅此[评估详情页](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md)。
您可通过浏览器内置的查看器功能轻松浏览本数据集。针对多数任务,我们提供了`is_correct`列,您可通过统计`is_correct=True`的占比快速获取该任务的准确率结果。对于同时包含二元指标(如exact_match)与连续指标(如F1分数)的任务,我们仅会采用二元指标来生成`is_correct`列,这可能与Llama 3.2模型卡片中公布的指标有所差异。
此外,还存在一个模型指标子集,收录了所有任务与子任务的全部公布指标,例如F1分数、宏平均/准确率等。请通过该子集获取模型卡片中公布的指标。
最后,您也可使用Huggingface Dataset API加载本数据集。例如,加载某项评估任务详情的代码示例如下:
python
from datasets import load_dataset
data = load_dataset("meta-llama/Llama-3.2-1B-Instruct-evals",
name="Llama-3.2-1B-Instruct-evals__mmlu_italian_chat__details",
split="latest"
)
请查阅我们的[评估流程指南](https://github.com/meta-llama/llama-recipes/tree/main/tools/benchmarks/llm_eval_harness/meta_eval_reproduce),该指南演示了如何使用[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main)库针对选定任务计算Llama 3.2公布的基准测试指标。
以下为任务评估详情各列的详细说明:
**任务类型(task_type)**:评估任务的执行类型,分为「生成式(Generative)」与「选择式(Choice)」两类。生成式任务会返回模型输出结果;而选择式任务则会返回补全内容的负对数似然值(针对非指令模型的多项选择任务通常采用选择式任务方案)。
**任务名称(task_name)**:Meta内部使用的评估任务名称。
**子任务名称(subtask_name)**:当基准测试存在子分类时(例如带有领域划分的MMLU),该字段为Meta内部使用的子任务名称。
**输入问题(input_question)**:指输入数据集中的原始问题(若可用)。若该数据已在评估流程中被重写,或为输入数据集字段的复杂拼接,则该字段将为序列化后的提示对象字符串。
**输入选项列表(input_choice_list)**:针对多项选择题,该字段包含选项名称与对应文本的映射关系。
**最终输入提示(input_final_prompt)**:即提供给模型用于推理的最终输入文本。对于选择式任务,该字段为提供给模型的提示数组,我们将通过计算不同补全内容的似然值以获取模型最终给出的答案。
**输入正确响应(input_correct_responses)**:即针对输入问题的正确响应数组。
**输出预测文本(output_prediction_text)**:生成式任务的模型输出结果。
**输出解析答案(output_parsed_answer)**:即从模型输出中解析得到的答案,或通过负对数似然值计算得到的答案。
**输出选项补全内容(output_choice_completions)**:针对选择式任务,该列表为提供给模型以计算负对数似然值的补全内容。
**输出选项负对数似然值(output_choice_negative_log_likelihoods)**:针对选择式任务,该字段为上述补全内容对应的负对数似然值,并已针对不同序列长度(文本、Token、原始长度)进行归一化处理。
**输出指标(output_metrics)**:即样本级别的计算指标。常见指标包括:
acc - 准确率(accuracy)
em - 精确匹配度(exact_match)
f1 - F1分数
pass@1 - 针对编码基准测试,输出的代码是否通过测试
**正确性标记(is_correct)**:即解析得到的答案是否与目标响应一致,判定为正确。(仅适用于支持该布尔指标的基准测试)
**输入问题哈希值(input_question_hash)**:即采用UTF-8编码的问题文本的SHA256哈希值。
**最终输入提示哈希值数组(input_final_prompts_hash)**:即采用UTF-8编码的输入提示文本的SHA256哈希值数组。
**基准测试标签(benchmark_label)**:即通用的基准测试名称。
**评估配置(eval_config)**:即与本次评估运行配置相关的附加元数据:
num_generations - 生成参数,即需生成的输出数量
num_shots - 少样本(Few-shot)示例数量
max_gen_len - 生成参数(即需生成的Token数量)
prompt_fn - 提示函数,若可用则包含Jinja模板
max_prompt_len - 生成参数,提示的最大Token数量。若`input_final_prompt`的长度超过该配置,则会对其进行截断
return_logprobs - 生成参数,即生成输出时是否返回对数概率。
提供机构:
maas
创建时间:
2024-09-26



