TAUR-dev/D-EVAL__standard_eval_v3__FinEval_16k_fulleval_3arg_OLMO_RLONLY-RL-countdown_6arg-eval_rl
收藏Hugging Face2025-12-02 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/TAUR-dev/D-EVAL__standard_eval_v3__FinEval_16k_fulleval_3arg_OLMO_RLONLY-RL-countdown_6arg-eval_rl
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
config_name: latest
features:
- name: question
dtype: string
- name: answer
dtype: string
- name: task_config
dtype: string
- name: task_source
dtype: string
- name: prompt
list:
- name: content
dtype: string
- name: role
dtype: string
- name: model_responses
list: 'null'
- name: model_responses__eval_is_correct
list: 'null'
- name: all_other_columns
dtype: string
- name: original_split
dtype: string
- name: metadata
dtype: string
- name: model_responses__best_of_n_atags
list: string
- name: model_responses__best_of_n_atags__finish_reason_length_flags
list: bool
- name: model_responses__best_of_n_atags__length_partial_responses
list: string
- name: prompt__best_of_n_atags__metadata
dtype: string
- name: model_responses__best_of_n_atags__metadata
dtype: string
- name: model_responses__best_of_n_atags__eval_is_correct
list: bool
- name: model_responses__best_of_n_atags__eval_extracted_answers
list: string
- name: model_responses__best_of_n_atags__eval_extraction_metadata
dtype: string
- name: model_responses__best_of_n_atags__eval_evaluation_metadata
dtype: string
- name: model_responses__best_of_n_atags__internal_answers__eval_is_correct
list:
list: bool
- name: model_responses__best_of_n_atags__internal_answers__eval_extracted_answers
list:
list: string
- name: model_responses__best_of_n_atags__internal_answers__eval_extraction_metadata
dtype: string
- name: model_responses__best_of_n_atags__internal_answers__eval_evaluation_metadata
dtype: string
- name: model_responses__best_of_n_atags__metrics
struct:
- name: flips_by
list: int64
- name: flips_total
dtype: int64
- name: num_correct
dtype: int64
- name: pass_at_n
dtype: int64
- name: percent_correct
dtype: float64
- name: total_responses
dtype: int64
- name: eval_date
dtype: string
splits:
- name: test
num_bytes: 90317587
num_examples: 1000
download_size: 11966390
dataset_size: 90317587
configs:
- config_name: latest
data_files:
- split: test
path: latest/test-*
---
数据集信息:
配置名称:latest
特征列表:
- 名称:问题(Question),数据类型:字符串
- 名称:答案(Answer),数据类型:字符串
- 名称:任务配置(Task Config),数据类型:字符串
- 名称:任务来源(Task Source),数据类型:字符串
- 名称:提示词(Prompt),列表类型:
- 子特征:内容(Content),数据类型:字符串
- 子特征:角色(Role),数据类型:字符串
- 名称:模型回复(Model Responses),列表类型,初始值为null
- 名称:模型回复_评估正确性(Model Responses__Eval_Is_Correct),列表类型,初始值为null
- 名称:其余所有列(All Other Columns),数据类型:字符串
- 名称:原始数据集划分(Original Split),数据类型:字符串
- 名称:元数据(Metadata),数据类型:字符串
- 名称:模型回复_最优n个候选标签(Model Responses__Best_Of_N_ATags),列表类型,元素为字符串
- 名称:模型回复_最优n个候选标签_终止原因长度标记(Model Responses__Best_Of_N_ATags__Finish_Reason_Length_Flags),列表类型,元素为布尔值
- 名称:模型回复_最优n个候选标签_部分回复长度(Model Responses__Best_Of_N_ATags__Length_Partial_Responses),列表类型,元素为字符串
- 名称:提示词_最优n个候选标签_元数据(Prompt__Best_Of_N_ATags__Metadata),数据类型:字符串
- 名称:模型回复_最优n个候选标签_元数据(Model Responses__Best_Of_N_ATags__Metadata),数据类型:字符串
- 名称:模型回复_最优n个候选标签_评估正确性(Model Responses__Best_Of_N_ATags__Eval_Is_Correct),列表类型,元素为布尔值
- 名称:模型回复_最优n个候选标签_评估提取答案(Model Responses__Best_Of_N_ATags__Eval_Extracted_Answers),列表类型,元素为字符串
- 名称:模型回复_最优n个候选标签_评估提取元数据(Model Responses__Best_Of_N_ATags__Eval_Extraction_Metadata),数据类型:字符串
- 名称:模型回复_最优n个候选标签_评估元数据(Model Responses__Best_Of_N_ATags__Eval_Evaluation_Metadata),数据类型:字符串
- 名称:模型回复_最优n个候选标签_内部答案_评估正确性(Model Responses__Best_Of_N_ATags__Internal_Answers__Eval_Is_Correct),列表类型的列表,元素为布尔值
- 名称:模型回复_最优n个候选标签_内部答案_评估提取答案(Model Responses__Best_Of_N_ATags__Internal_Answers__Eval_Extracted_Answers),列表类型的列表,元素为字符串
- 名称:模型回复_最优n个候选标签_内部答案_评估提取元数据(Model Responses__Best_Of_N_ATags__Internal_Answers__Eval_Extraction_Metadata),数据类型:字符串
- 名称:模型回复_最优n个候选标签_内部答案_评估元数据(Model Responses__Best_Of_N_ATags__Internal_Answers__Eval_Evaluation_Metadata),数据类型:字符串
- 名称:模型回复_最优n个候选标签_评估指标(Model Responses__Best_Of_N_ATags__Metrics),结构体类型:
- 名称:按维度翻转次数(Flips_By),列表类型,元素为64位整数(int64)
- 名称:总翻转次数(Flips_Total),数据类型:64位整数(int64)
- 名称:正确样本数(Num_Correct),数据类型:64位整数(int64)
- 名称:n级通过率(Pass_At_N),数据类型:64位整数(int64)
- 名称:正确率(Percent_Correct),数据类型:64位浮点数(float64)
- 名称:总回复数(Total_Responses),数据类型:64位整数(int64)
- 名称:评估日期(Eval_Date),数据类型:字符串
数据集划分:
- 划分名称:测试集(Test),字节数:90317587,样本数量:1000
下载大小:11966390
数据集总大小:90317587
配置项:
- 配置名称:latest,数据文件:
- 划分:测试集(Test),路径:latest/test-*
提供机构:
TAUR-dev



