five

DataDecide-eval-instances

收藏
魔搭社区2025-07-11 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/DataDecide-eval-instances
下载链接
链接失效反馈
官方服务:
资源简介:
# DataDecide evaluation instances This dataset contains data for individual evaluation instances from the DataDecide project (publication forthcoming). It shows how standard evaluation benchmarks can vary across many dimensions of model design. The dataset contains evaluations for a range of OLMo-style models trained with: * 25 different training data configurations * 9 different sizes with parameter counts 4M, 20M, 60M, 90M, 150M, 300M, 750M, and 1B * 3 initial random seeds * Multiple training checkpoints for each model (~10 to ~50 depending on size) * The 10 different evaluation tasks from [OLMES](https://arxiv.org/abs/2406.08446), using cloze formulation: * ARC Challenge, ARC Easy, BoolQ, CSQA, HellaSwag, MMLU (57 subtasks), OBQA, PIQA, Social IQa, Winogrande * 4 different evaluation methods for ranking model answers In total there are around 150k model checkpoints and 500M individual evaluation instances. The cloze formulation (as opposed to the "A/B/C/D" multiple choice format) is used to because these models are generally too small to have mastered that format. The dataset is organized (after untarring) as follows: ``` models/ ├── model_name/ # training mix used, e.g., "dclm-baseline" │ ├── size/ # e.g., "150M" │ │ ├── seed/ # e.g., "seed-14" │ │ │ └── step/ # model checkpoint, e.g., "step-25000" │ │ │ ├── arc_challenge-metrics.json │ │ │ ├── arc_challenge-predictions.jsonl │ │ │ ├── ... ``` See the `sample-evals` directory for one example of each task The `-metrics.json` file contains the overall metrics for the task while the `-predictions.jsonl` file contains the predictions for each instance in the following format where the metric suffixes corresponding to different ways of normalizing the model probabilities when ranking the answer choices (see [OLMES](https://arxiv.org/abs/2406.08446) for details) * `_raw`: Raw probability * `_per_token`: log-probability per token * `_per_char`: log-probability per character * `_uncond`: probability of answer divided by unconditional probability of answer (no question given) Here is an example of a prediction line with annotations: ``` { "doc_id": 0, # consecutive instance index "native_id": "Mercury_7175875", # task-specific identifier "metrics": { # Overall metrics "predicted_index_raw": 3, # predicted answer indices "predicted_index_per_token": 3, "predicted_index_per_char": 3, "predicted_index_uncond": 1, "correct_choice": 2, # correct answer index "acc_raw": 0, # accuracies for each method "acc_per_token": 0, "acc_per_char": 0, "acc_uncond": 0}, "model_output": [ # list of model outputs for each answer choice { # first answer choice "sum_logits": -23.55691146850586, # sum of logprobs of answer tokens "num_tokens": 6, # number of answer tokens "num_tokens_all": 201, # number of tokens in prompt plus answer "is_greedy": false, # whether the answer was the greedy model completion "sum_logits_uncond": -34.12132263183594, # sum of logprobs for unconditional answer tokens "logits_per_token": -3.926151911417643, # normalized logprobs "logits_per_char": -0.7138458020759352, "logits_per_byte": 1.029861798615096, "num_chars": 33 # number of characters in answer choice }, ... ], "label": 2, # correct answer index "task_hash": "da4d61b1b678cfae04369e8a9c4bed3a", # hash of task configuration "model_hash": "596f2b97e34140bf3c9e37fa70e7a5a2" # hash of model configuration } ``` In addition, the dataset contains a `summary-metrics.jsonl` file with summary metrics for each task and model configuration. The `requests` directory contains all the exact model requests used for each instance.

# DataDecide 评估实例 本数据集收录了DataDecide项目(即将发表)的单条评估实例数据,展示了标准评估基准如何随模型设计的多个维度发生变化。 本数据集包含针对一系列采用以下训练方式得到的OLMo风格模型的评估结果: * 25种不同的训练数据配置 * 9种参数量规模,参数规模涵盖4M、20M、60M、90M、150M、300M、750M与1B * 3组初始随机种子 * 每个模型对应多个训练检查点(依模型规模不同,数量约为10至50个) * 采用完形填空(cloze)范式,包含来自[OLMES](https://arxiv.org/abs/2406.08446)的10种不同评估任务,具体包括:ARC Challenge(ARC挑战集)、ARC Easy(ARC简单集)、BoolQ(布尔问答)、CSQA(常识问答)、HellaSwag、MMLU(含57个子任务)、OBQA(开放域问答)、PIQA(物理交互问答)、Social IQa(社会智能问答)、Winogrande * 4种不同的模型答案排序评估方法 整体而言,本数据集共包含约15万个模型检查点与5亿条单条评估实例。 本数据集采用完形填空(cloze)范式(而非"A/B/C/D"多项选择格式),原因在于这些模型规模普遍较小,尚未掌握多项选择格式。 数据集解压后的目录结构如下: models/ ├── model_name/ # 所使用的训练混合策略,例如"dclm-baseline" │ ├── size/ # 例如"150M" │ │ ├── seed/ # 例如"seed-14" │ │ │ └── step/ # 模型检查点,例如"step-25000" │ │ │ ├── arc_challenge-metrics.json │ │ │ ├── arc_challenge-predictions.jsonl │ │ │ ├── ... 可查看`sample-evals`目录获取各任务的示例文件。 其中,`-metrics.json`文件包含任务的整体评估指标,而`-predictions.jsonl`文件则以如下格式存储每条实例的预测结果,其中指标后缀对应对模型概率进行归一化以排序答案选项的不同方式(详细说明可参考[OLMES](https://arxiv.org/abs/2406.08446)): * `_raw`:原始概率 * `_per_token`:每Token对数概率 * `_per_char`:每字符对数概率 * `_uncond`:答案概率除以无上下文(仅给出答案,无问题)时的答案概率 以下为带注释的预测行示例: { "doc_id": 0, # 连续实例索引 "native_id": "Mercury_7175875", # 任务专属标识符 "metrics": { # 整体评估指标 "predicted_index_raw": 3, # 基于原始概率的预测答案索引 "predicted_index_per_token": 3, "predicted_index_per_char": 3, "predicted_index_uncond": 1, "correct_choice": 2, # 正确答案索引 "acc_raw": 0, # 各方法对应的准确率 "acc_per_token": 0, "acc_per_char": 0, "acc_uncond": 0}, "model_output": [ # 各答案选项对应的模型输出列表 { # 第一个答案选项 "sum_logits": -23.55691146850586, # 答案Token的对数概率和 "num_tokens": 6, # 答案Token数量 "num_tokens_all": 201, # 提示词加答案的总Token数 "is_greedy": false, # 该答案是否为模型的贪心补全结果 "sum_logits_uncond": -34.12132263183594, # 无上下文时答案Token的对数概率和 "logits_per_token": -3.926151911417643, # 归一化后的每Token对数概率 "logits_per_char": -0.7138458020759352, # 归一化后的每字符对数概率 "logits_per_byte": 1.029861798615096, # 归一化后的每字节对数概率 "num_chars": 33 # 答案选项的字符数 }, ... ], "label": 2, # 正确答案索引 "task_hash": "da4d61b1b678cfae04369e8a9c4bed3a", # 任务配置哈希值 "model_hash": "596f2b97e34140bf3c9e37fa70e7a5a2" # 模型配置哈希值 } 此外,数据集还包含`summary-metrics.jsonl`文件,其中存储了各任务与模型配置的汇总评估指标。 `requests`目录收录了各评估实例所用的全部原始模型请求数据。
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作