taresco/details_meta-llama__Llama-3.1-8B
收藏Hugging Face2025-04-10 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/taresco/details_meta-llama__Llama-3.1-8B
下载链接
链接失效反馈官方服务:
资源简介:
这是在评估模型 [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) 运行期间自动创建的数据集。数据集由 27 个配置组成,每个配置对应一个评估任务。数据集由 31 次运行创建。每次运行都可以在每个配置中找到特定的分割,分割名称使用运行的时间戳命名。"train" 分割始终指向最新结果。还有一个额外的配置 "results" 存储所有运行的汇总结果。要加载运行的详细信息,可以执行以下操作:
python
from datasets import load_dataset
data = load_dataset("taresco/details_meta-llama__Llama-3.1-8B", "results", split="train")
## 最新结果
这些是从运行 2025-04-10T10:05:40.552314 的 [最新结果](https://huggingface.co/datasets/taresco/details_meta-llama__Llama-3.1-8B/blob/main/results_2025-04-10T10-05-40.552314.json)(注意,如果连续评估没有涵盖相同的任务,存储库中可能有其他任务的结果。您可以在每个评估的 "latest" 分割中找到每个结果):
python
{
"all": {
"judge_score_gpt-4o": 0.008,
"judge_score_gpt-4o_stderr": 0.0056454836766901715
},
"community|afrimathevals:afrimgsm_wol|0": {
"judge_score_gpt-4o": 0.008,
"judge_score_gpt-4o_stderr": 0.0056454836766901715
}
}
Dataset automatically created during the evaluation run of model [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B).
The dataset is composed of 27 configuration, each one coresponding to one of the evaluated task.
The dataset has been created from 31 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.
An additional configuration "results" store all the aggregated results of the run.
To load the details from a run, you can for instance do the following:
python
from datasets import load_dataset
data = load_dataset("taresco/details_meta-llama__Llama-3.1-8B", "results", split="train")
## Latest results
These are the [latest results from run 2025-04-10T10:05:40.552314](https://huggingface.co/datasets/taresco/details_meta-llama__Llama-3.1-8B/blob/main/results_2025-04-10T10-05-40.552314.json)(note that their might be results for other tasks in the repos if successive evals didnt cover the same tasks. You find each in the results and the "latest" split for each eval):
python
{
"all": {
"judge_score_gpt-4o": 0.008,
"judge_score_gpt-4o_stderr": 0.0056454836766901715
},
"community|afrimathevals:afrimgsm_wol|0": {
"judge_score_gpt-4o": 0.008,
"judge_score_gpt-4o_stderr": 0.0056454836766901715
}
}
提供机构:
taresco



