llama-4-eval-logs-and-scores
收藏魔搭社区2026-01-06 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/twinkle-ai/llama-4-eval-logs-and-scores
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for llama-4-eval-logs-and-scores
<!-- Provide a quick summary of the dataset. -->

This repository contains the detailed evaluation results of **Llama 4** models, tested using [Twinkle Eval](https://github.com/ai-twinkle/Eval), a robust and efficient AI evaluation tool developed by Twinkle AI. Each entry includes per-question scores across multiple benchmark suites.
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
This dataset provides the complete evaluation logs and per-question scores of various Llama 4 models, including Scout and Maverick FP8, tested under a standardized and reproducible setting. All evaluations were conducted using Twinkle Eval, a high-precision and efficient benchmark framework developed by Twinkle AI.
The benchmark includes shuffled multiple-choice options and repeated trials (3-run average) for reliability. This repository serves as a transparent and structured archive of how the models perform across different tasks, with every question's result available for analysis and verification.
- **Curated by:** Twinkle AI
- **License:** MIT
### Llama 4 Benchmark Results (Evaluated with Twinkle Eval)
| Model | TMMLU+ | MMLU | tw-legal |
|------------|--------|-------|----------|
| **Scout** | 67.71 | 82.31 | 47.21 |
| **Maverick** | 78.28 | 87.26 | 61.40 |
> *Maverick using the FP8 format
> *All results are based on three runs with randomized options.
### Detailed Logs
The full evaluation logs, including per-question results, are available here:
- **Llama-4-Scout-17B-16E-Instruct**: [`results/scout/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/scout)
- **Llama-4-Maverick-17B-128E-Instruct-FP8**: [`results/maverick/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/maverick)
These files contain the raw evaluation outputs recorded by **Twinkle Eval**, including detailed answers, scores, and metadata for each benchmarked question.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
```yaml
@misc{twinkleai2025llama4eval,
title = {Llama 4 Evaluation Logs and Scores},
author = {Twinkle AI},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/twinkleai/llama-4-eval-logs-and-scores}},
note = {Evaluated using Twinkle Eval, a benchmark framework by Twinkle AI}
}
```
# llama-4-eval-logs-and-scores 数据集卡片
<!-- 提供该数据集的快速摘要。 -->

本仓库收录了**Llama 4**系列模型的详细评测结果,所有模型均通过Twinkle AI开发的高性能AI评测工具[Twinkle Eval](https://github.com/ai-twinkle/Eval)完成测试。每条数据均包含多个基准测试套件下的单题得分。
## 数据集详情
### 数据集描述
<!-- 提供该数据集的详细说明。 -->
本数据集收录了多款Llama 4系列模型(包括Scout与FP8格式的Maverick)在标准化可复现实验设置下的完整评测日志与单题得分。所有评测均由Twinkle AI开发的高精度高效基准测试框架Twinkle Eval完成。
本次基准测试采用打乱的选择题选项设置,并通过三次重复实验取平均以保证结果可靠性。本仓库以透明结构化的方式归档了各模型在不同任务上的表现,所有单题结果均可用于分析与验证。
- **整理方:** Twinkle AI
- **授权协议:** MIT
### Llama 4 基准测试结果(基于Twinkle Eval评测)
| 模型 | TMMLU+ | MMLU | tw-legal |
|--------------|--------|--------|----------|
| **Scout** | 67.71 | 82.31 | 47.21 |
| **Maverick** | 78.28 | 87.26 | 61.40 |
> *注:Maverick采用FP8格式
> *所有结果均基于三次随机化选项的重复实验
### 详细评测日志
完整的评测日志(含单题结果)可通过以下路径获取:
- **Llama-4-Scout-17B-16E-Instruct**:[`results/scout/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/scout)
- **Llama-4-Maverick-17B-128E-Instruct-FP8**:[`results/maverick/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/maverick)
这些文件包含由**Twinkle Eval**记录的原始评测输出,涵盖每个被评测题目的详细回答、得分与元数据。
## 引用信息
yaml
@misc{twinkleai2025llama4eval,
title = {Llama 4 Evaluation Logs and Scores},
author = {Twinkle AI},
year = {2025},
howpublished = {url{https://huggingface.co/datasets/twinkleai/llama-4-eval-logs-and-scores}},
note = {Evaluated using Twinkle Eval, a benchmark framework by Twinkle AI}
}
提供机构:
maas
创建时间:
2025-05-20



