five

llama-4-eval-logs-and-scores

收藏
魔搭社区2026-01-06 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/twinkle-ai/llama-4-eval-logs-and-scores
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for llama-4-eval-logs-and-scores <!-- Provide a quick summary of the dataset. --> ![image/png](https://cdn-uploads.huggingface.co/production/uploads/618dc56cbc345ca7bf95f3cd/li95VdaXTmVRod6ONwhu8.png) This repository contains the detailed evaluation results of **Llama 4** models, tested using [Twinkle Eval](https://github.com/ai-twinkle/Eval), a robust and efficient AI evaluation tool developed by Twinkle AI. Each entry includes per-question scores across multiple benchmark suites. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> This dataset provides the complete evaluation logs and per-question scores of various Llama 4 models, including Scout and Maverick FP8, tested under a standardized and reproducible setting. All evaluations were conducted using Twinkle Eval, a high-precision and efficient benchmark framework developed by Twinkle AI. The benchmark includes shuffled multiple-choice options and repeated trials (3-run average) for reliability. This repository serves as a transparent and structured archive of how the models perform across different tasks, with every question's result available for analysis and verification. - **Curated by:** Twinkle AI - **License:** MIT ### Llama 4 Benchmark Results (Evaluated with Twinkle Eval) | Model | TMMLU+ | MMLU | tw-legal | |------------|--------|-------|----------| | **Scout** | 67.71 | 82.31 | 47.21 | | **Maverick** | 78.28 | 87.26 | 61.40 | > *Maverick using the FP8 format > *All results are based on three runs with randomized options. ### Detailed Logs The full evaluation logs, including per-question results, are available here: - **Llama-4-Scout-17B-16E-Instruct**: [`results/scout/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/scout) - **Llama-4-Maverick-17B-128E-Instruct-FP8**: [`results/maverick/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/maverick) These files contain the raw evaluation outputs recorded by **Twinkle Eval**, including detailed answers, scores, and metadata for each benchmarked question. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> ```yaml @misc{twinkleai2025llama4eval, title = {Llama 4 Evaluation Logs and Scores}, author = {Twinkle AI}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/twinkleai/llama-4-eval-logs-and-scores}}, note = {Evaluated using Twinkle Eval, a benchmark framework by Twinkle AI} } ```

# llama-4-eval-logs-and-scores 数据集卡片 <!-- 提供该数据集的快速摘要。 --> ![image/png](https://cdn-uploads.huggingface.co/production/uploads/618dc56cbc345ca7bf95f3cd/li95VdaXTmVRod6ONwhu8.png) 本仓库收录了**Llama 4**系列模型的详细评测结果,所有模型均通过Twinkle AI开发的高性能AI评测工具[Twinkle Eval](https://github.com/ai-twinkle/Eval)完成测试。每条数据均包含多个基准测试套件下的单题得分。 ## 数据集详情 ### 数据集描述 <!-- 提供该数据集的详细说明。 --> 本数据集收录了多款Llama 4系列模型(包括Scout与FP8格式的Maverick)在标准化可复现实验设置下的完整评测日志与单题得分。所有评测均由Twinkle AI开发的高精度高效基准测试框架Twinkle Eval完成。 本次基准测试采用打乱的选择题选项设置,并通过三次重复实验取平均以保证结果可靠性。本仓库以透明结构化的方式归档了各模型在不同任务上的表现,所有单题结果均可用于分析与验证。 - **整理方:** Twinkle AI - **授权协议:** MIT ### Llama 4 基准测试结果(基于Twinkle Eval评测) | 模型 | TMMLU+ | MMLU | tw-legal | |--------------|--------|--------|----------| | **Scout** | 67.71 | 82.31 | 47.21 | | **Maverick** | 78.28 | 87.26 | 61.40 | > *注:Maverick采用FP8格式 > *所有结果均基于三次随机化选项的重复实验 ### 详细评测日志 完整的评测日志(含单题结果)可通过以下路径获取: - **Llama-4-Scout-17B-16E-Instruct**:[`results/scout/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/scout) - **Llama-4-Maverick-17B-128E-Instruct-FP8**:[`results/maverick/`](https://huggingface.co/datasets/twinkle-ai/llama-4-eval-logs-and-scores/tree/main/results/maverick) 这些文件包含由**Twinkle Eval**记录的原始评测输出,涵盖每个被评测题目的详细回答、得分与元数据。 ## 引用信息 yaml @misc{twinkleai2025llama4eval, title = {Llama 4 Evaluation Logs and Scores}, author = {Twinkle AI}, year = {2025}, howpublished = {url{https://huggingface.co/datasets/twinkleai/llama-4-eval-logs-and-scores}}, note = {Evaluated using Twinkle Eval, a benchmark framework by Twinkle AI} }
提供机构:
maas
创建时间:
2025-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作