llama-4-eval-logs-and-scores

Name: llama-4-eval-logs-and-scores
Creator: maas
Published: 2026-01-06 16:33:01
License: 暂无描述

魔搭社区2026-01-06 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/twinkle-ai/llama-4-eval-logs-and-scores

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for llama-4-eval-logs-and-scores  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/618dc56cbc345ca7bf95f3cd/li95VdaXTmVRod6ONwhu8.png) This repository contains the detailed evaluation results of **Llama 4** models, tested using [Twinkle Eval](https://github.com/ai-twinkle/Eval), a robust and efficient AI evaluation tool developed by Twinkle AI. Each entry includes per-question scores across multiple benchmark suites. ## Dataset Details ### Dataset Description  This dataset provides the complete evaluation logs and per-question scores of various Llama 4 models, including Scout and Maverick FP8, tested under a standardized and reproducible setting. All evaluations were conducted using Twinkle Eval, a high-precision and efficient benchmark framework developed by Twinkle AI. The benchmark includes shuffled multiple-choice options and repeated trials (3-run average) for reliability. This repository serves as a transparent and structured archive of how the models perform across different tasks, with every question's result available for analysis and verification. - **Curated by:** Twinkle AI - **License:** MIT ### Llama 4 Benchmark Results (Evaluated with Twinkle Eval) | Model | TMMLU+ | MMLU | tw-legal | |------------|--------|-------|----------| | **Scout** | 67.71 | 82.31 | 47.21 | | **Maverick** | 78.28 | 87.26 | 61.40 | > *Maverick using the FP8 format > *All results are based on three runs with randomized options. ### Detailed Logs The full evaluation logs, including per-question results, are available here: - **Llama-4-Scout-17B-16E-Instruct**: [`results/scout/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/scout) - **Llama-4-Maverick-17B-128E-Instruct-FP8**: [`results/maverick/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/maverick) These files contain the raw evaluation outputs recorded by **Twinkle Eval**, including detailed answers, scores, and metadata for each benchmarked question. ## Citation  ```yaml @misc{twinkleai2025llama4eval, title = {Llama 4 Evaluation Logs and Scores}, author = {Twinkle AI}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/twinkleai/llama-4-eval-logs-and-scores}}, note = {Evaluated using Twinkle Eval, a benchmark framework by Twinkle AI} } ```

# llama-4-eval-logs-and-scores 数据集卡片  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/618dc56cbc345ca7bf95f3cd/li95VdaXTmVRod6ONwhu8.png) 本仓库收录了**Llama 4**系列模型的详细评测结果，所有模型均通过Twinkle AI开发的高性能AI评测工具[Twinkle Eval](https://github.com/ai-twinkle/Eval)完成测试。每条数据均包含多个基准测试套件下的单题得分。 ## 数据集详情 ### 数据集描述  本数据集收录了多款Llama 4系列模型（包括Scout与FP8格式的Maverick）在标准化可复现实验设置下的完整评测日志与单题得分。所有评测均由Twinkle AI开发的高精度高效基准测试框架Twinkle Eval完成。本次基准测试采用打乱的选择题选项设置，并通过三次重复实验取平均以保证结果可靠性。本仓库以透明结构化的方式归档了各模型在不同任务上的表现，所有单题结果均可用于分析与验证。 - **整理方：** Twinkle AI - **授权协议：** MIT ### Llama 4 基准测试结果（基于Twinkle Eval评测） | 模型 | TMMLU+ | MMLU | tw-legal | |--------------|--------|--------|----------| | **Scout** | 67.71 | 82.31 | 47.21 | | **Maverick** | 78.28 | 87.26 | 61.40 | > *注：Maverick采用FP8格式 > *所有结果均基于三次随机化选项的重复实验 ### 详细评测日志完整的评测日志（含单题结果）可通过以下路径获取： - **Llama-4-Scout-17B-16E-Instruct**：[`results/scout/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/scout) - **Llama-4-Maverick-17B-128E-Instruct-FP8**：[`results/maverick/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/maverick) 这些文件包含由**Twinkle Eval**记录的原始评测输出，涵盖每个被评测题目的详细回答、得分与元数据。 ## 引用信息 yaml @misc{twinkleai2025llama4eval, title = {Llama 4 Evaluation Logs and Scores}, author = {Twinkle AI}, year = {2025}, howpublished = {url{https://huggingface.co/datasets/twinkleai/llama-4-eval-logs-and-scores}}, note = {Evaluated using Twinkle Eval, a benchmark framework by Twinkle AI} }

提供机构：

maas

创建时间：

2025-05-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集