twinkle-ai/llama-4-eval-logs-and-scores
收藏Hugging Face2025-04-09 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/twinkle-ai/llama-4-eval-logs-and-scores
下载链接
链接失效反馈官方服务:
资源简介:
本数据集提供了Llama 4模型(包括Scout和Maverick FP8格式)在标准化和可重现设置下的完整评估日志和每个问题的得分。所有评估都是使用Twinkle AI开发的高精度且高效的评估框架Twinkle Eval进行的。数据集包含了随机打乱的多项选择题选项和三次重复试验的平均值,以便于可靠性的分析。这个仓库作为一个透明的结构化档案,记录了模型在不同任务中的表现,每个问题的结果都可以进行分析和验证。
This dataset provides the complete evaluation logs and per-question scores of various Llama 4 models, including Scout and Maverick FP8, tested under a standardized and reproducible setting using Twinkle Eval, a high-precision and efficient benchmark framework developed by Twinkle AI. The benchmark includes shuffled multiple-choice options and repeated trials (3-run average) for reliability. The repository serves as a transparent and structured archive of how the models perform across different tasks, with every questions result available for analysis and verification.
提供机构:
twinkle-ai



