tytodd/qwen3.5-2b-v1
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tytodd/qwen3.5-2b-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: aes2_essay_scoring
features:
- name: input
struct:
- name: full_text
dtype: string
- name: prediction
struct:
- name: score
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 545057951
num_examples: 10000
- name: val
num_bytes: 55072649
num_examples: 1000
download_size: 454073152
dataset_size: 600130600
- config_name: arc_challenge
features:
- name: input
struct:
- name: choices
dtype: string
- name: question
dtype: string
- name: prediction
struct:
- name: choice
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 49254830
num_examples: 1172
download_size: 36059088
dataset_size: 49254830
- config_name: argument_quality_ranking
features:
- name: input
struct:
- name: argument
dtype: string
- name: topic
dtype: string
- name: prediction
struct:
- name: quality_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 110883510
num_examples: 2469
download_size: 78320499
dataset_size: 110883510
- config_name: bbeh
features:
- name: input
struct:
- name: question
dtype: string
- name: task
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 182922405
num_examples: 2120
download_size: 131245828
dataset_size: 182922405
- config_name: bbh_causal_judgement
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 7429515
num_examples: 149
download_size: 5426046
dataset_size: 7429515
- config_name: bbh_disambiguation_qa
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 8743863
num_examples: 200
download_size: 6259765
dataset_size: 8743863
- config_name: bbh_geometric_shapes
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 9649811
num_examples: 200
download_size: 7099846
dataset_size: 9649811
- config_name: bbh_movie_recommendation
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 8778846
num_examples: 200
download_size: 6280040
dataset_size: 8778846
- config_name: bbh_reasoning_about_colored_objects
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 4287578
num_examples: 200
download_size: 3153586
dataset_size: 4287578
- config_name: bbh_ruin_names
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 6746441
num_examples: 200
download_size: 4920687
dataset_size: 6746441
- config_name: bbh_salient_translation_error_detection
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 9526499
num_examples: 200
download_size: 6876220
dataset_size: 9526499
- config_name: bbh_snarks
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 6009091
num_examples: 142
download_size: 4320370
dataset_size: 6009091
- config_name: bbh_sports_understanding
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 6376275
num_examples: 200
download_size: 4570956
dataset_size: 6376275
- config_name: bbh_tracking_shuffled_objects_five_objects
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 5800485
num_examples: 200
download_size: 4192184
dataset_size: 5800485
- config_name: bbh_web_of_lies
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: answer
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 7087308
num_examples: 200
download_size: 5015749
dataset_size: 7087308
- config_name: civil_comments
features:
- name: input
struct:
- name: comment
dtype: string
- name: prediction
struct:
- name: toxicity_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 385418759
num_examples: 10000
- name: val
num_bytes: 37967014
num_examples: 1000
download_size: 301127727
dataset_size: 423385773
- config_name: code_judge_bench
features:
- name: input
struct:
- name: code_A
dtype: string
- name: code_B
dtype: string
- name: problem
dtype: string
- name: prediction
struct:
- name: label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 39778128
num_examples: 344
download_size: 30571424
dataset_size: 39778128
- config_name: colbert_humor_detection
features:
- name: input
struct:
- name: text
dtype: string
- name: prediction
struct:
- name: humor_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 421447123
num_examples: 10000
- name: val
num_bytes: 42620370
num_examples: 1000
download_size: 327254593
dataset_size: 464067493
- config_name: customer_support_tickets_en
features:
- name: input
struct:
- name: body
dtype: string
- name: subject
dtype: string
- name: prediction
struct:
- name: queue
dtype: string
- name: type
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 266959983
num_examples: 5570
- name: val
num_bytes: 49577903
num_examples: 1000
download_size: 230028792
dataset_size: 316537886
- config_name: customer_support_tickets_gorkem
features:
- name: input
struct:
- name: ticket_text
dtype: string
- name: prediction
struct:
- name: ticket_subject
dtype: string
- name: ticket_type
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 343174222
num_examples: 6775
- name: val
num_bytes: 50156096
num_examples: 1000
download_size: 277723949
dataset_size: 393330318
- config_name: go_emotions
features:
- name: input
struct:
- name: text
dtype: string
- name: prediction
struct:
- name: labels
list: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 1974664153
num_examples: 43410
- name: val
num_bytes: 204657118
num_examples: 4500
download_size: 1563304438
dataset_size: 2179321271
- config_name: gpqa_diamond
features:
- name: input
struct:
- name: question
dtype: string
- name: prediction
struct:
- name: choice
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 16706771
num_examples: 198
download_size: 11849726
dataset_size: 16706771
- config_name: halueval_summarization
features:
- name: input
struct:
- name: document
dtype: string
- name: summary
dtype: string
- name: prediction
struct:
- name: hallucination
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 684555651
num_examples: 10000
download_size: 494299736
dataset_size: 684555651
- config_name: hh_rlhf
features:
- name: input
struct:
- name: question
dtype: string
- name: response_A
dtype: string
- name: response_B
dtype: string
- name: prediction
struct:
- name: label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 546155367
num_examples: 10000
- name: val
num_bytes: 53767111
num_examples: 1000
download_size: 431379598
dataset_size: 599922478
- config_name: judge_bench
features:
- name: input
struct:
- name: question
dtype: string
- name: response_A
dtype: string
- name: response_B
dtype: string
- name: prediction
struct:
- name: label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 23341893
num_examples: 280
download_size: 17549850
dataset_size: 23341893
- config_name: lex_glue_case_hold
features:
- name: input
struct:
- name: context
dtype: string
- name: option_a
dtype: string
- name: option_b
dtype: string
- name: option_c
dtype: string
- name: option_d
dtype: string
- name: option_e
dtype: string
- name: prediction
struct:
- name: selected_option
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 504478961
num_examples: 10000
- name: val
num_bytes: 50758127
num_examples: 1000
download_size: 403612465
dataset_size: 555237088
- config_name: lex_glue_scotus
features:
- name: input
struct:
- name: opinion_text
dtype: string
- name: prediction
struct:
- name: issue_id
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 470100934
num_examples: 5000
- name: val
num_bytes: 111627205
num_examples: 1000
download_size: 527829121
dataset_size: 581728139
- config_name: medical_abstracts
features:
- name: input
struct:
- name: medical_abstract
dtype: string
- name: prediction
struct:
- name: condition_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 452164658
num_examples: 10000
- name: val
num_bytes: 45718370
num_examples: 1000
download_size: 357787840
dataset_size: 497883028
- config_name: mfrc
features:
- name: input
struct:
- name: text
dtype: string
- name: prediction
struct:
- name: authority
dtype: bool
- name: care
dtype: bool
- name: fairness
dtype: bool
- name: loyalty
dtype: bool
- name: non_moral
dtype: bool
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 2167216273
num_examples: 55103
- name: val
num_bytes: 214625927
num_examples: 5500
download_size: 2329024906
dataset_size: 2381842200
- config_name: mmlu
features:
- name: input
struct:
- name: choices
dtype: string
- name: question
dtype: string
- name: prediction
struct:
- name: choice
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 706902361
num_examples: 14042
download_size: 516973828
dataset_size: 706902361
- config_name: mmlu_pro
features:
- name: input
struct:
- name: choices
dtype: string
- name: question
dtype: string
- name: prediction
struct:
- name: choice
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 672402062
num_examples: 12032
download_size: 488338596
dataset_size: 672402062
- config_name: musr_murder_mysteries
features:
- name: input
struct:
- name: choices
dtype: string
- name: question
dtype: string
- name: prediction
struct:
- name: choice
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 18986456
num_examples: 250
download_size: 13902082
dataset_size: 18986456
- config_name: musr_object_placements
features:
- name: input
struct:
- name: choices
dtype: string
- name: question
dtype: string
- name: prediction
struct:
- name: choice
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 15816571
num_examples: 256
download_size: 11818369
dataset_size: 15816571
- config_name: musr_team_allocation
features:
- name: input
struct:
- name: choices
dtype: string
- name: question
dtype: string
- name: prediction
struct:
- name: choice
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 17521186
num_examples: 250
download_size: 13111884
dataset_size: 17521186
- config_name: or_bench_80k
features:
- name: input
struct:
- name: prompt
dtype: string
- name: prediction
struct:
- name: or_bench_category
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 595824437
num_examples: 10000
- name: val
num_bytes: 58618453
num_examples: 1000
download_size: 462015069
dataset_size: 654442890
- config_name: or_bench_hard_1k
features:
- name: input
struct:
- name: prompt
dtype: string
- name: prediction
struct:
- name: or_bench_category
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 48436940
num_examples: 1055
- name: val
num_bytes: 12252996
num_examples: 264
download_size: 43302033
dataset_size: 60689936
- config_name: or_bench_toxic
features:
- name: input
struct:
- name: prompt
dtype: string
- name: prediction
struct:
- name: or_bench_category
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 23239838
num_examples: 524
download_size: 16510017
dataset_size: 23239838
- config_name: projudgebench
features:
- name: input
struct:
- name: correct_answer
dtype: string
- name: question
dtype: string
- name: step_to_evaluate
dtype: string
- name: steps
list: string
- name: prediction
struct:
- name: correct
dtype: bool
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 189075220
num_examples: 2160
- name: val
num_bytes: 21906441
num_examples: 240
download_size: 153029795
dataset_size: 210981661
- config_name: reward_bench_2
features:
- name: input
struct:
- name: prompt
dtype: string
- name: response_A
dtype: string
- name: response_B
dtype: string
- name: prediction
struct:
- name: label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 79714700
num_examples: 1492
- name: val
num_bytes: 19288626
num_examples: 373
download_size: 73686677
dataset_size: 99003326
- config_name: rod101_essay_scoring
features:
- name: input
struct:
- name: text
dtype: string
- name: prediction
struct:
- name: score
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: ood
num_bytes: 3147005
num_examples: 81
download_size: 3056366
dataset_size: 3147005
- config_name: seekbench
features:
- name: input
struct:
- name: current_trace
dtype: string
- name: previous_traces
dtype: string
- name: question
dtype: string
- name: prediction
struct:
- name: groundness
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 23323358
num_examples: 446
- name: val
num_bytes: 9318096
num_examples: 184
download_size: 24664275
dataset_size: 32641454
- config_name: seekbench_evidence
features:
- name: input
struct:
- name: current_trace
dtype: string
- name: previous_traces
dtype: string
- name: question
dtype: string
- name: prediction
struct:
- name: clear
dtype: string
- name: sufficient
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 20092595
num_examples: 324
- name: val
num_bytes: 8545000
num_examples: 143
download_size: 21651780
dataset_size: 28637595
- config_name: seekbench_full_trace
features:
- name: input
struct:
- name: final_answer
dtype: string
- name: question
dtype: string
- name: trace
dtype: string
- name: prediction
struct:
- name: correctness
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 7429989
num_examples: 133
- name: val
num_bytes: 3361048
num_examples: 57
download_size: 8420291
dataset_size: 10791037
- config_name: sem_eval_2010_task_8
features:
- name: input
struct:
- name: sentence
dtype: string
- name: prediction
struct:
- name: relation_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 516147603
num_examples: 8000
- name: val
num_bytes: 170187724
num_examples: 2717
download_size: 481020081
dataset_size: 686335327
- config_name: smollm_corpus
features:
- name: input
struct:
- name: text
dtype: string
- name: prediction
struct:
- name: audience
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 568696818
num_examples: 10000
- name: val
num_bytes: 57273807
num_examples: 1000
download_size: 475555394
dataset_size: 625970625
- config_name: snli
features:
- name: input
struct:
- name: hypothesis
dtype: string
- name: premise
dtype: string
- name: prediction
struct:
- name: label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 435804515
num_examples: 10000
- name: val
num_bytes: 43322708
num_examples: 1000
download_size: 340374654
dataset_size: 479127223
- config_name: support_tickets_alpha
features:
- name: input
struct:
- name: description
dtype: string
- name: subject
dtype: string
- name: prediction
struct:
- name: key_phrase
dtype: string
- name: support_class
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 32115815
num_examples: 813
- name: val
num_bytes: 5445773
num_examples: 125
download_size: 26844090
dataset_size: 37561588
- config_name: toxigen_data
features:
- name: input
struct:
- name: text
dtype: string
- name: prediction
struct:
- name: toxicity_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 374413076
num_examples: 8960
- name: val
num_bytes: 39858032
num_examples: 940
download_size: 288758310
dataset_size: 414271108
- config_name: tweet_eval_emotion
features:
- name: input
struct:
- name: tweet
dtype: string
- name: prediction
struct:
- name: emotion_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 126818292
num_examples: 3257
- name: val
num_bytes: 14502622
num_examples: 374
download_size: 99122597
dataset_size: 141320914
- config_name: tweet_eval_hate
features:
- name: input
struct:
- name: tweet
dtype: string
- name: prediction
struct:
- name: hate_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 370046489
num_examples: 8993
- name: val
num_bytes: 41449232
num_examples: 999
download_size: 288215633
dataset_size: 411495721
- config_name: tweet_eval_irony
features:
- name: input
struct:
- name: tweet
dtype: string
- name: prediction
struct:
- name: irony_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 114385831
num_examples: 2862
- name: val
num_bytes: 38167261
num_examples: 955
download_size: 108609615
dataset_size: 152553092
- config_name: tweet_eval_offensive
features:
- name: input
struct:
- name: tweet
dtype: string
- name: prediction
struct:
- name: offensive_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 399364169
num_examples: 10000
- name: val
num_bytes: 40221955
num_examples: 1000
download_size: 306654211
dataset_size: 439586124
- config_name: tweet_eval_sentiment
features:
- name: input
struct:
- name: tweet
dtype: string
- name: prediction
struct:
- name: sentiment_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 402699401
num_examples: 10000
- name: val
num_bytes: 40079108
num_examples: 1000
download_size: 309576448
dataset_size: 442778509
- config_name: tweet_eval_stance_abortion
features:
- name: input
struct:
- name: topic
dtype: string
- name: tweet
dtype: string
- name: prediction
struct:
- name: stance_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 28621932
num_examples: 587
- name: val
num_bytes: 3442542
num_examples: 66
download_size: 22630215
dataset_size: 32064474
- config_name: tweet_eval_stance_atheism
features:
- name: input
struct:
- name: topic
dtype: string
- name: tweet
dtype: string
- name: prediction
struct:
- name: stance_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 22508555
num_examples: 461
- name: val
num_bytes: 2730304
num_examples: 52
download_size: 17937319
dataset_size: 25238859
- config_name: tweet_eval_stance_climate
features:
- name: input
struct:
- name: topic
dtype: string
- name: tweet
dtype: string
- name: prediction
struct:
- name: stance_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 17516196
num_examples: 355
- name: val
num_bytes: 1927658
num_examples: 40
download_size: 13708511
dataset_size: 19443854
- config_name: tweet_eval_stance_feminist
features:
- name: input
struct:
- name: topic
dtype: string
- name: tweet
dtype: string
- name: prediction
struct:
- name: stance_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 28646047
num_examples: 597
- name: val
num_bytes: 3330035
num_examples: 67
download_size: 22715503
dataset_size: 31976082
- config_name: tweet_eval_stance_hillary
features:
- name: input
struct:
- name: topic
dtype: string
- name: tweet
dtype: string
- name: prediction
struct:
- name: stance_label
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 27544670
num_examples: 620
- name: val
num_bytes: 3114492
num_examples: 69
download_size: 21690218
dataset_size: 30659162
- config_name: ultrafeedback
features:
- name: input
struct:
- name: prompt
dtype: string
- name: response
dtype: string
- name: prediction
struct:
- name: instruction_following
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 468892737
num_examples: 10000
- name: val
num_bytes: 47119619
num_examples: 1000
download_size: 372756384
dataset_size: 516012356
- config_name: yelp
features:
- name: input
struct:
- name: text
dtype: string
- name: prediction
struct:
- name: rating
dtype: string
- name: reasoning
dtype: string
- name: messages
struct:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: outputs
struct:
- name: reasoning_content
dtype: string
- name: text
dtype: string
- name: correct
dtype: bool
splits:
- name: train
num_bytes: 417976335
num_examples: 10000
- name: val
num_bytes: 40980000
num_examples: 1000
download_size: 328596831
dataset_size: 458956335
configs:
- config_name: aes2_essay_scoring
data_files:
- split: train
path: aes2_essay_scoring/train-*
- split: val
path: aes2_essay_scoring/val-*
- config_name: arc_challenge
data_files:
- split: ood
path: arc_challenge/ood-*
- config_name: argument_quality_ranking
data_files:
- split: ood
path: argument_quality_ranking/ood-*
- config_name: bbeh
data_files:
- split: ood
path: bbeh/ood-*
- config_name: bbh_causal_judgement
data_files:
- split: ood
path: bbh_causal_judgement/ood-*
- config_name: bbh_disambiguation_qa
data_files:
- split: ood
path: bbh_disambiguation_qa/ood-*
- config_name: bbh_geometric_shapes
data_files:
- split: ood
path: bbh_geometric_shapes/ood-*
- config_name: bbh_movie_recommendation
data_files:
- split: ood
path: bbh_movie_recommendation/ood-*
- config_name: bbh_reasoning_about_colored_objects
data_files:
- split: ood
path: bbh_reasoning_about_colored_objects/ood-*
- config_name: bbh_ruin_names
data_files:
- split: ood
path: bbh_ruin_names/ood-*
- config_name: bbh_salient_translation_error_detection
data_files:
- split: ood
path: bbh_salient_translation_error_detection/ood-*
- config_name: bbh_snarks
data_files:
- split: ood
path: bbh_snarks/ood-*
- config_name: bbh_sports_understanding
data_files:
- split: ood
path: bbh_sports_understanding/ood-*
- config_name: bbh_tracking_shuffled_objects_five_objects
data_files:
- split: ood
path: bbh_tracking_shuffled_objects_five_objects/ood-*
- config_name: bbh_web_of_lies
data_files:
- split: ood
path: bbh_web_of_lies/ood-*
- config_name: civil_comments
data_files:
- split: train
path: civil_comments/train-*
- split: val
path: civil_comments/val-*
- config_name: code_judge_bench
data_files:
- split: ood
path: code_judge_bench/ood-*
- config_name: colbert_humor_detection
data_files:
- split: train
path: colbert_humor_detection/train-*
- split: val
path: colbert_humor_detection/val-*
- config_name: customer_support_tickets_en
data_files:
- split: train
path: customer_support_tickets_en/train-*
- split: val
path: customer_support_tickets_en/val-*
- config_name: customer_support_tickets_gorkem
data_files:
- split: train
path: customer_support_tickets_gorkem/train-*
- split: val
path: customer_support_tickets_gorkem/val-*
- config_name: go_emotions
data_files:
- split: train
path: go_emotions/train-*
- split: val
path: go_emotions/val-*
- config_name: gpqa_diamond
data_files:
- split: ood
path: gpqa_diamond/ood-*
- config_name: halueval_summarization
data_files:
- split: ood
path: halueval_summarization/ood-*
- config_name: hh_rlhf
data_files:
- split: train
path: hh_rlhf/train-*
- split: val
path: hh_rlhf/val-*
- config_name: judge_bench
data_files:
- split: ood
path: judge_bench/ood-*
- config_name: lex_glue_case_hold
data_files:
- split: train
path: lex_glue_case_hold/train-*
- split: val
path: lex_glue_case_hold/val-*
- config_name: lex_glue_scotus
data_files:
- split: train
path: lex_glue_scotus/train-*
- split: val
path: lex_glue_scotus/val-*
- config_name: medical_abstracts
data_files:
- split: train
path: medical_abstracts/train-*
- split: val
path: medical_abstracts/val-*
- config_name: mfrc
data_files:
- split: train
path: mfrc/train-*
- split: val
path: mfrc/val-*
- config_name: mmlu
data_files:
- split: ood
path: mmlu/ood-*
- config_name: mmlu_pro
data_files:
- split: ood
path: mmlu_pro/ood-*
- config_name: musr_murder_mysteries
data_files:
- split: ood
path: musr_murder_mysteries/ood-*
- config_name: musr_object_placements
data_files:
- split: ood
path: musr_object_placements/ood-*
- config_name: musr_team_allocation
data_files:
- split: ood
path: musr_team_allocation/ood-*
- config_name: or_bench_80k
data_files:
- split: train
path: or_bench_80k/train-*
- split: val
path: or_bench_80k/val-*
- config_name: or_bench_hard_1k
data_files:
- split: train
path: or_bench_hard_1k/train-*
- split: val
path: or_bench_hard_1k/val-*
- config_name: or_bench_toxic
data_files:
- split: ood
path: or_bench_toxic/ood-*
- config_name: projudgebench
data_files:
- split: train
path: projudgebench/train-*
- split: val
path: projudgebench/val-*
- config_name: reward_bench_2
data_files:
- split: train
path: reward_bench_2/train-*
- split: val
path: reward_bench_2/val-*
- config_name: rod101_essay_scoring
data_files:
- split: ood
path: rod101_essay_scoring/ood-*
- config_name: seekbench
data_files:
- split: train
path: seekbench/train-*
- split: val
path: seekbench/val-*
- config_name: seekbench_evidence
data_files:
- split: train
path: seekbench_evidence/train-*
- split: val
path: seekbench_evidence/val-*
- config_name: seekbench_full_trace
data_files:
- split: train
path: seekbench_full_trace/train-*
- split: val
path: seekbench_full_trace/val-*
- config_name: sem_eval_2010_task_8
data_files:
- split: train
path: sem_eval_2010_task_8/train-*
- split: val
path: sem_eval_2010_task_8/val-*
- config_name: smollm_corpus
data_files:
- split: train
path: smollm_corpus/train-*
- split: val
path: smollm_corpus/val-*
- config_name: snli
data_files:
- split: train
path: snli/train-*
- split: val
path: snli/val-*
- config_name: support_tickets_alpha
data_files:
- split: train
path: support_tickets_alpha/train-*
- split: val
path: support_tickets_alpha/val-*
- config_name: toxigen_data
data_files:
- split: train
path: toxigen_data/train-*
- split: val
path: toxigen_data/val-*
- config_name: tweet_eval_emotion
data_files:
- split: train
path: tweet_eval_emotion/train-*
- split: val
path: tweet_eval_emotion/val-*
- config_name: tweet_eval_hate
data_files:
- split: train
path: tweet_eval_hate/train-*
- split: val
path: tweet_eval_hate/val-*
- config_name: tweet_eval_irony
data_files:
- split: train
path: tweet_eval_irony/train-*
- split: val
path: tweet_eval_irony/val-*
- config_name: tweet_eval_offensive
data_files:
- split: train
path: tweet_eval_offensive/train-*
- split: val
path: tweet_eval_offensive/val-*
- config_name: tweet_eval_sentiment
data_files:
- split: train
path: tweet_eval_sentiment/train-*
- split: val
path: tweet_eval_sentiment/val-*
- config_name: tweet_eval_stance_abortion
data_files:
- split: train
path: tweet_eval_stance_abortion/train-*
- split: val
path: tweet_eval_stance_abortion/val-*
- config_name: tweet_eval_stance_atheism
data_files:
- split: train
path: tweet_eval_stance_atheism/train-*
- split: val
path: tweet_eval_stance_atheism/val-*
- config_name: tweet_eval_stance_climate
data_files:
- split: train
path: tweet_eval_stance_climate/train-*
- split: val
path: tweet_eval_stance_climate/val-*
- config_name: tweet_eval_stance_feminist
data_files:
- split: train
path: tweet_eval_stance_feminist/train-*
- split: val
path: tweet_eval_stance_feminist/val-*
- config_name: tweet_eval_stance_hillary
data_files:
- split: train
path: tweet_eval_stance_hillary/train-*
- split: val
path: tweet_eval_stance_hillary/val-*
- config_name: ultrafeedback
data_files:
- split: train
path: ultrafeedback/train-*
- split: val
path: ultrafeedback/val-*
- config_name: yelp
data_files:
- split: train
path: yelp/train-*
- split: val
path: yelp/val-*
---
# qwen3.5-2b-v1
- Repo: `tytodd/qwen3.5-2b-v1`
- Config: `/Users/tytodd/Desktop/Modaic/code/core/probe-lab/configs/datasets/v1/v1.yaml`
- Model: `Qwen/Qwen3.5-2B`
- Runtime: `Modal` local vLLM on `localhost`
| benchmark | train | val | ood | all |
| --- | --- | --- | --- | --- |
| customer_support_tickets_gorkem | 1.40% | 1.10% | | 1.36% |
| mfrc | 0.00% | 0.00% | | 0.00% |
| go_emotions | 17.95% | 18.20% | | 17.98% |
| customer_support_tickets_en | 21.20% | 20.20% | | 21.05% |
| aes2_essay_scoring | 20.74% | 18.70% | | 20.55% |
| ultrafeedback | 40.67% | 41.30% | | 40.73% |
| smollm_corpus | 33.05% | 33.00% | | 33.05% |
| or_bench_80k | 26.73% | 44.60% | | 28.35% |
| lex_glue_scotus | 49.44% | 40.80% | | 48.00% |
| medical_abstracts | 63.47% | 63.70% | | 63.49% |
| seekbench_evidence | 60.80% | 69.23% | | 63.38% |
| yelp | 51.50% | 48.90% | | 51.26% |
| tweet_eval_sentiment | 58.48% | 60.70% | | 58.68% |
| hh_rlhf | 51.92% | 54.50% | | 52.15% |
| tweet_eval_stance_hillary | 60.32% | 56.52% | | 59.94% |
| tweet_eval_stance_atheism | 70.72% | 63.46% | | 69.98% |
| reward_bench_2 | 76.07% | 84.18% | | 77.69% |
| sem_eval_2010_task_8 | 37.76% | 38.90% | | 38.05% |
| seekbench | 71.52% | 70.65% | | 71.27% |
| tweet_eval_irony | 54.75% | 55.71% | | 54.99% |
| tweet_eval_offensive | 71.63% | 72.00% | | 71.66% |
| lex_glue_case_hold | 62.25% | 63.20% | | 62.34% |
| seekbench_full_trace | 77.44% | 84.21% | | 79.47% |
| tweet_eval_hate | 68.94% | 62.06% | | 68.25% |
| or_bench_hard_1k | 45.50% | 45.83% | | 45.56% |
| tweet_eval_stance_climate | 61.41% | 65.00% | | 61.77% |
| snli | 77.03% | 80.50% | | 77.35% |
| tweet_eval_stance_feminist | 60.64% | 55.22% | | 60.09% |
| tweet_eval_stance_abortion | 60.48% | 57.58% | | 60.18% |
| toxigen_data | 72.80% | 67.66% | | 72.31% |
| tweet_eval_emotion | 77.19% | 75.94% | | 77.06% |
| civil_comments | 83.96% | 80.00% | | 83.60% |
| projudgebench | 74.58% | 86.67% | | 75.79% |
| colbert_humor_detection | 68.02% | 69.50% | | 68.15% |
| support_tickets_alpha | 90.41% | 85.60% | | 89.77% |
| argument_quality_ranking | | | 50.06% | 50.06% |
| rod101_essay_scoring | | | 29.63% | 29.63% |
| or_bench_toxic | | | 52.29% | 52.29% |
| judge_bench | | | 60.36% | 60.36% |
| musr_team_allocation | | | 68.80% | 68.80% |
| musr_object_placements | | | 47.66% | 47.66% |
| bbh_disambiguation_qa | | | 59.00% | 59.00% |
| bbh_causal_judgement | | | 57.05% | 57.05% |
| musr_murder_mysteries | | | 50.40% | 50.40% |
| halueval_summarization | | | 58.87% | 58.87% |
| bbh_salient_translation_error_detection | | | 68.50% | 68.50% |
| bbh_movie_recommendation | | | 61.50% | 61.50% |
| bbh_sports_understanding | | | 63.00% | 63.00% |
| bbeh | | | 18.63% | 18.63% |
| bbh_geometric_shapes | | | 78.00% | 78.00% |
| code_judge_bench | | | 11.63% | 11.63% |
| mmlu_pro | | | 60.06% | 60.06% |
| bbh_ruin_names | | | 46.50% | 46.50% |
| bbh_snarks | | | 66.20% | 66.20% |
| gpqa_diamond | | | 40.40% | 40.40% |
| bbh_web_of_lies | | | 99.00% | 99.00% |
| mmlu | | | 69.20% | 69.20% |
| bbh_reasoning_about_colored_objects | | | 98.00% | 98.00% |
| bbh_tracking_shuffled_objects_five_objects | | | 99.50% | 99.50% |
| arc_challenge | | | 86.86% | 86.86% |
| all | 37.82% | 38.84% | 60.75% | 40.79% |
提供机构:
tytodd



