Llama-3.2-1B-evals
收藏魔搭社区2026-04-28 更新2024-10-05 收录
下载链接:
https://modelscope.cn/datasets/LLM-Research/Llama-3.2-1B-evals
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Meta Evaluation Result Details for Llama-3.2-1B
<!-- Provide a quick summary of the dataset. -->
This dataset contains the results of the Meta evaluation result details for **Llama-3.2-1B**. The dataset has been created from 8 evaluation tasks. The tasks are: needle_in_haystack, mmlu, squad, quac, drop, arc_challenge, multi_needle, agieval_english.
Each task detail can be found as a specific subset in each configuration nd each subset is named using the task name plus the timestamp of the upload time and ends with "__details".
For more information about the eval tasks, please refer to this [eval details](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md) page.
You can use the Viewer feature to view the dataset in the web browser easily. For most tasks, we provide an "is_correct" column, so you can quickly get our accuracy result of the task by viewing the percentage of "is_correct=True". For tasks that have both binary (eg. exact_match) and a continuous metrics (eg. f1), we will only consider the binary metric for adding the is_correct column. This might differ from the reported metric in the Llama 3.2 model card.
Additionally, there is a model metrics subset that contains all the reported metrics, like f1, macro_avg/acc, for all the tasks and subtasks. Please use this subset to find reported metrics in the model card.
Lastly, you can also use Huggingface Dataset APIs to load the dataset. For example, to load a eval task detail, you can use the following code:
```python
from datasets import load_dataset
data = load_dataset("meta-llama/Llama-3.2-1B-evals",
name="Llama-3.2-1B-evals__agieval_english__details",
split="latest"
)
```
Please check our [eval recipe](https://github.com/meta-llama/llama-recipes/tree/main/tools/benchmarks/llm_eval_harness/meta_eval_reproduce) that demonstrates how to calculate the Llama 3.2 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library on selected tasks.
Here are the detailed explanation for each column of the task eval details:
**task_type**: Whether the eval task was run as a ‘Generative’ or ‘Choice’ task. Generative task returns the model output, whereas for choice tasks we return the negative log likelihoods of the completion. (The choice task approach is typically used for multiple choice tasks for non-instruct models)
**task_name**: Meta internal eval task name
**subtask_name**: Meta internal subtask name in cases where the benchmark has subcategories (Ex. MMLU with domains)
**input_question**: The question from the input dataset when available. In cases when that data is overwritten as a part of the evaluation pipeline or it is a complex concatenation of input dataset fields, this will be the serialized prompt object as a string.
**input_choice_list**: In the case of multiple choice questions, this contains a map of the choice name to the text.
**input_final_prompt**: The final input text that is provided to the model for inference. For choice tasks, this will be an array of prompts provided to the model, where we calculate the likelihoods of the different completions in order to get the final answer provided by the model.
**input_correct_responses**: An array of correct responses to the input question.
**output_prediction_text**: The model output for a Generative task
**output_parsed_answer**: The answer we’ve parsed from the model output or calculated using negative log likelihoods.
**output_choice_completions**: For choice tasks, the list of completions we’ve provided to the model to calculate negative log likelihoods
**output_choice_negative_log_likelihoods**: For choice tasks, these are the corresponding negative log likelihoods normalized by different sequence lengths (text, token, raw) for the above completions.
**output_metrics**: Metrics calculated at the example level. Common metrics include:
acc - accuracy
em - exact_match
f1 - F1 score
pass@1 - For coding benchmarks, whether the output code passes tests
**is_correct**: Whether the parsed answer matches the target responses and consider correct. (Only applicable for benchmarks which have such a boolean metric)
**input_question_hash**: The SHA256 hash of the question text encoded as UTF-8
**input_final_prompts_hash**: An array of SHA256 hash of the input prompt text encoded as UTF-8
**benchmark_label**: The commonly used benchmark name
**eval_config**: Additional metadata related to the configurations we used to run this evaluation
num_generations - Generation parameter - how many outputs to generate
num_shots - How many few shot examples to include in the prompt.
max_gen_len - generation parameter (how many tokens to generate)
prompt_fn - The prompt function with jinja template when available
max_prompt_len - Generation parameter. Maximum number tokens for the prompt. If the input_final_prompt is longer than this configuration, we will truncate
return_logprobs - Generation parameter - Whether to return log probabilities when generating output.
# Llama-3.2-1B 元评估结果详情数据集卡片
<!-- 请提供数据集的简要概述。 -->
本数据集包含**Llama-3.2-1B**的元评估结果详情,其构建自8项评估任务,具体任务包括:needle_in_haystack、mmlu、squad、quac、drop、arc_challenge、multi_needle、agieval_english。
每项评估任务的详情可通过对应配置下的特定子集获取,子集命名格式为「任务名+上传时间戳」,并以「__details」结尾。
如需了解评估任务的更多信息,请参阅[评估详情页面](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md)。
您可通过浏览器内置的查看器(Viewer)功能便捷浏览本数据集。针对多数任务,我们提供了`is_correct`列,您可通过统计`is_correct=True`的占比,快速获取该任务的准确率结果。对于同时包含二分类指标(如exact_match)与连续型指标(如F1分数)的任务,我们仅采用二分类指标生成`is_correct`列,该逻辑可能与Llama 3.2模型卡片中公布的指标计算方式存在差异。
此外,本数据集还包含一个模型指标子集,收录了所有任务与子任务的官方公布指标(如F1分数、宏平均准确率/准确率)。如需查阅模型卡片中公布的官方指标,请使用该子集。
最后,您也可通过Hugging Face数据集API加载本数据集。例如,若要加载某一项评估任务的详情,可使用如下代码:
python
from datasets import load_dataset
data = load_dataset("meta-llama/Llama-3.2-1B-evals",
name="Llama-3.2-1B-evals__agieval_english__details",
split="latest"
)
请参阅我们的[评估流程指南](https://github.com/meta-llama/llama-recipes/tree/main/tools/benchmarks/llm_eval_harness/meta_eval_reproduce),该指南演示了如何通过[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main)库,针对指定任务计算Llama 3.2模型的官方基准测试指标。
以下为任务评估详情各列的详细说明:
**task_type**:评估任务的执行类型,分为「生成式任务(Generative Task)」与「选择题型任务(Choice Task)」两类。生成式任务会返回模型的输出结果;而选择题型任务则会返回补全文本的负对数似然值(negative log likelihoods),该题型通常用于非指令模型的多项选择题任务。
**task_name**:Meta内部评估任务名称
**subtask_name**:当基准测试包含子类别时的Meta内部子任务名称(例如带领域分类的MMLU)
**input_question**:输入数据集中的原始问题(若可用)。若该数据在评估流程中被改写,或是由输入数据集字段经复杂拼接生成,则此处会以字符串形式返回序列化后的提示词对象。
**input_choice_list**:针对多项选择题,该列存储选项名称与对应文本的映射关系。
**input_final_prompt**:用于模型推理的最终输入文本。针对选择题型任务,该列会是一个提示词数组,我们将通过计算不同补全文本的似然值,得到模型输出的最终答案。
**input_correct_responses**:输入问题对应的正确响应数组。
**output_prediction_text**:生成式任务的模型输出结果。
**output_parsed_answer**:从模型输出中解析得到的答案,或是通过负对数似然值计算得到的答案。
**output_choice_completions**:针对选择题型任务,该列为提供给模型的补全文本列表,用于计算负对数似然值。
**output_choice_negative_log_likelihoods**:针对选择题型任务,该列为上述补全文本对应的负对数似然值(negative log likelihoods),已按不同序列长度(文本长度、Token长度、原始长度)进行归一化处理。
**output_metrics**:单样本维度计算得到的指标,常见指标包括:
- acc:准确率(accuracy)
- em:精确匹配度(exact_match)
- f1:F1分数(F1 score)
- pass@1:针对代码基准测试,指输出代码能否通过测试用例
**is_correct**:标记解析得到的答案是否与目标响应一致,即是否判定为正确(仅适用于支持该布尔指标的基准测试)。
**input_question_hash**:以UTF-8编码的问题文本的SHA256哈希值(SHA256 hash)。
**input_final_prompts_hash**:以UTF-8编码的输入提示词文本的SHA256哈希值(SHA256 hash)数组。
**benchmark_label**:通用基准测试名称。
**eval_config**:本次评估运行所使用配置的附加元数据,具体包括:
- num_generations:生成参数,指需生成的模型输出数量
- num_shots:提示词中包含的少样本示例数量
- max_gen_len:生成参数,指模型可生成的最大Token数
- prompt_fn:若可用,指搭载Jinja模板的提示词函数
- max_prompt_len:生成参数,指提示词的最大Token数。若输入的最终提示词长度超出该配置,则会对其进行截断
- return_logprobs:生成参数,指生成输出时是否返回对数概率值(log probabilities)
提供机构:
maas
创建时间:
2024-09-26



