sghosts/tubitak-olimpiyat-dataset
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sghosts/tubitak-olimpiyat-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- tr
task_categories:
- question-answering
- multiple-choice
- visual-question-answering
- text-generation
pretty_name: TUBITAK Science Olympiad Dataset
size_categories:
- 1K<n<10K
license: cc-by-4.0
dataset_info:
features:
- name: id
dtype: string
- name: subject
dtype: string
- name: year
dtype: int64
- name: stage
dtype: int64
- name: question_number
dtype: int64
- name: question_image
dtype: image
- name: solution_image
dtype: image
- name: question_latex
dtype: string
- name: solution_latex
dtype: string
- name: has_solution
dtype: bool
- name: has_figure
dtype: bool
- name: has_choices
dtype: bool
- name: choice_values
dtype: string
- name: has_answer
dtype: bool
- name: answer_letter
dtype: string
- name: answer_value
dtype: string
splits:
- name: bilgisayar
num_bytes: 178222370.0
num_examples: 863
- name: fizik
num_bytes: 106735399.0
num_examples: 332
- name: matematik
num_bytes: 129297926.0
num_examples: 671
- name: ortaokul_bilgisayar
num_bytes: 38764778.0
num_examples: 233
- name: ortaokul_matematik
num_bytes: 87348055.0
num_examples: 599
download_size: 528875575
dataset_size: 540368528.0
configs:
- config_name: default
data_files:
- split: bilgisayar
path: data/bilgisayar-*
- split: fizik
path: data/fizik-*
- split: matematik
path: data/matematik-*
- split: ortaokul_bilgisayar
path: data/ortaokul_bilgisayar-*
- split: ortaokul_matematik
path: data/ortaokul_matematik-*
---
# TUBITAK Science Olympiad Dataset
This dataset contains multiple-choice and open-ended scientific questions sourced from the TUBITAK (The Scientific and Technological Research Council of Turkey) Science Olympiads spanning various years. It is intended to serve as a benchmark for evaluating the advanced analytical, mathematical, and computational reasoning capabilities of Large Language Models (LLMs) in the Turkish language.
The dataset comprises approximately 2700 problems across five domains: Computer Science, Physics, Mathematics, Middle School Computer Science, and Middle School Mathematics. The raw problems have been formatted, OCR processed (using `deepseek-ai/DeepSeek-OCR-2`), and augmented with structural rules to test multi-step reasoning.
## Dataset Structure
Each entry in the dataset represents a specific problem from the competition stages (typically Stage 1).
- **id**: Unique identifier of the problem (e.g., Matematik_2024_1.Asama_1).
- **subject**: Science domain (Matematik, Fizik, Bilgisayar, Ortaokul Matematik, vb.).
- **year**: The year of the examination.
- **stage**: Examination stage (1 or 2). Note: Computer Science and Physics contain only Stage 1 questions.
- **question_number**: The specific problem number within the exam booklet.
- **question_image**: The primary image associated with the question.
- **solution_image**: Link to the solution image (if any).
- **question_latex**: The textual representation of the problem (includes LaTeX formulations where applicable).
- **solution_latex**: LaTeX formatted solution text (if any).
- **has_solution**: Indicates whether the problem has a solution.
- **has_figure**: Boolean flag indicating if the problem essentially relies on visual context (accuracy is not 100%).
- **has_choices**: Indicates whether the question is multiple-choice or open-ended.
- **choice_values**: Array of multi-choice options (A, B, C, D, E).
- **has_answer**: Indicates whether the problem has an answer.
- **answer_letter**: The correct choice letter.
- **answer_value**: The actual content of the correct choice.
## Important Characteristics & Limitations
- **Visual Context:** Visuals within questions are marked as [IMAGE]. For problems sharing a common block of text or context, the explanatory text/image is embedded on top of the question image of the respective problem. The context format traditionally ends with `\n---\n`.
- **Cancellations:** Most cancelled questions from the official exams were skipped; however, recoverable ones were preserved (e.g., Middle School Computer-2020-Stage1-Booklet A-8 and 9 vs skipped Computer-2014-Stage1-28,30,31).
- **Reference Links:** Solutions to questions that strictly reference the previous problem have been largely modified to be standalone, but perfection is not guaranteed (see Computer-2020-Stage1-21 and 23).
- **Code Excerpts:** In Computer Science branches, the last 10-15 questions are typically C programming tasks formatted heavily in LaTeX. While recent years (e.g., 2025) might have these converted directly to images, older ones (e.g., 2024) do not always have an briefing image. Furthermore, any raw C code present in questions is wrapped within standard markdown c bracket blocks for clarity.
- **AI Intervention:** Please note that artificial intelligence (specifically OCR models) was utilized during the creation and structuring of this dataset, which carries a limited accuracy rate for complex LaTeX rendering.
## Usage
This dataset is particularly useful for:
- **Benchmarking:** Testing LLMs on demanding, multi-step scientific reasoning tasks in non-English contexts.
- **Multimodal Evaluation:** Correlating highly visual problem spaces (like the Physics branch) with text-only analytical capabilities.
- **Chain-of-Thought (CoT) Capabilities:** Eliciting formal proofs and deep understanding in mathematics, kinematics, and logic/code tracing.
## LLM Performance Evaluation / Benchmark
The most recent 2 years of Stage 1 questions for all active branches were evaluated using a strict single-prompt, Chain-of-Thought approach. Models were tasked to reason step-by-step and strictly output the final choice letter.
*(Cancelled ("IPTAL") problems were excluded from Accuracy calculations)*
| Model | Total Pass | Total Fail | Cancelled (Ignored) | Accuracy |
|:---|:---:|:---:|:---:|:---:|
| **Gemini 3.1 Pro** | 326 | 2 | 10 | **99.39%** |
| **Qwen3.5-397B-A17B + Thinking** | 319 | 10 | 9 | **96.96%** |
### Branch-Specific Overview
| Branch | Qwen Pass | Qwen Fail | Qwen Acc. | Gemini Pass | Gemini Fail | Gemini Acc. |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| Computer | 148 | 6 | 96.10% | 152 | 1 | 99.35% |
| Mathematics | 123 | 4 | 96.85% | 126 | 1 | 99.21% |
| Physics | 48 | 0 | 100.00% | 48 | 0 | 100.00% |
## Source & License
The original problems are sourced from the national science olympiads organized by TUBITAK (The Scientific and Technological Research Council of Turkey).
This formalized dataset is provided for research and educational purposes under the **CC BY 4.0** license. Necessary permissions have been acquired from TUBITAK by the research team for publishing this derived benchmark.
## Contact
COSMOS AI Research Group
Yildiz Technical University Computer Engineering Department
https://cosmos.yildiz.edu.tr/
cosmos@yildiz.edu.tr
语言:土耳其语(tr)
任务类别:问答、多项选择、视觉问答、文本生成
友好名称:TUBITAK科学奥林匹克数据集(TUBITAK Science Olympiad Dataset)
规模类别:1000 < 样本数 < 10000
许可证:CC BY 4.0
数据集信息:
特征:
- 名称:id,数据类型:字符串
- 名称:subject,数据类型:字符串
- 名称:year,数据类型:整数
- 名称:stage,数据类型:整数
- 名称:question_number,数据类型:整数
- 名称:question_image,数据类型:图像
- 名称:solution_image,数据类型:图像
- 名称:question_latex,数据类型:字符串
- 名称:solution_latex,数据类型:字符串
- 名称:has_solution,数据类型:布尔值
- 名称:has_figure,数据类型:布尔值
- 名称:has_choices,数据类型:布尔值
- 名称:choice_values,数据类型:字符串
- 名称:has_answer,数据类型:布尔值
- 名称:answer_letter,数据类型:字符串
- 名称:answer_value,数据类型:字符串
拆分信息:
- 拆分名称:bilgisayar(计算机学科),字节数:178222370.0,样本数:863
- 拆分名称:fizik(物理学科),字节数:106735399.0,样本数:332
- 拆分名称:matematik(数学学科),字节数:129297926.0,样本数:671
- 拆分名称:ortaokul_bilgisayar(初中计算机学科),字节数:38764778.0,样本数:233
- 拆分名称:ortaokul_matematik(初中数学学科),字节数:87348055.0,样本数:599
下载大小:528875575
数据集总大小:540368528.0
配置:
- 配置名称:default,数据文件:
- 拆分:bilgisayar,路径:data/bilgisayar-*
- 拆分:fizik,路径:data/fizik-*
- 拆分:matematik,路径:data/matematik-*
- 拆分:ortaokul_bilgisayar,路径:data/ortaokul_bilgisayar-*
- 拆分:ortaokul_matematik,路径:data/ortaokul_matematik-*
# TUBITAK科学奥林匹克数据集(TUBITAK Science Olympiad Dataset)
本数据集收录了来自土耳其科学技术研究理事会(TUBITAK,The Scientific and Technological Research Council of Turkey)历年科学奥林匹克竞赛的多项选择题与开放式科学试题,旨在作为基准数据集,用于评估大语言模型(Large Language Model,LLM)在土耳其语语境下的高级分析、数学推理与计算推理能力。
本数据集涵盖计算机科学、物理学、数学、初中计算机科学与初中数学五个领域,共约2700道试题。原始试题已完成格式标准化处理,使用`deepseek-ai/DeepSeek-OCR-2`进行光学字符识别(Optical Character Recognition,OCR),并结合结构化规则进行增强,以适配多步推理能力的测试。
## 数据集结构
数据集中的每个条目对应竞赛阶段(通常为第一阶段)的一道具体试题。
- **id**:试题的唯一标识符(例如:Matematik_2024_1.Asama_1)。
- **subject**:科学领域(如Matematik、Fizik、Bilgisayar、Ortaokul Matematik等)。
- **year**:考试年份。
- **stage**:竞赛阶段(1或2)。注:计算机科学与物理学科仅包含第一阶段试题。
- **question_number**:试题册内的具体试题编号。
- **question_image**:与试题关联的主图像。
- **solution_image**:试题解析图片的链接(若有)。
- **question_latex**:试题的文本表示形式(包含适用的LaTeX公式)。
- **solution_latex**:LaTeX格式的试题解析文本(若有)。
- **has_solution**:标识该试题是否附带解析。
- **has_figure**:布尔标记,用于指示试题是否依赖视觉上下文(准确率非100%)。
- **has_choices**:标识试题为选择题还是开放式答题。
- **choice_values**:多项选择题的选项集合(如A、B、C、D、E)。
- **has_answer**:标识该试题是否包含标准答案。
- **answer_letter**:正确选项的字母标识。
- **answer_value**:正确选项的具体内容。
## 重要特性与局限性
- **视觉上下文**:试题中的视觉内容标记为[IMAGE]。对于共享同一文本块或上下文的试题,说明性文本/图像将嵌入对应试题的试题图像顶部。上下文格式通常以`
---
`结尾。
- **作废试题**:官方考试中多数作废试题已被跳过,但可恢复的试题予以保留(例如:初中计算机-2020-第一阶段-试卷A-第8、9题,与被跳过的计算机-2014-第一阶段-第28、30、31题形成对比)。
- **参考关联**:严格引用前一道试题的解析已大部分修改为独立内容,但无法保证完美修正(参见计算机-2020-第一阶段-第21、23题)。
- **代码片段**:在计算机科学分支中,最后10-15道试题通常为以LaTeX格式重度排版的C语言编程任务。近年(如2025年)的此类试题可能已直接转换为图像,而旧试题(如2024年)未必附带说明性图像。此外,试题中出现的原始C语言代码将使用标准markdown的代码块格式包裹,以提升可读性。
- **人工智能辅助处理**:请注意,本数据集的创建与结构化过程中使用了人工智能(特指OCR模型),复杂LaTeX渲染的准确率存在一定局限。
## 使用场景
本数据集尤其适用于:
- **基准测试**:在非英语语境下,测试大语言模型面对高难度多步科学推理任务的性能。
- **多模态评估**:将物理学科等高度依赖视觉的试题空间与纯文本分析能力进行关联评测。
- **思维链(Chain-of-Thought,CoT)能力**:激发大语言模型在数学、运动学与逻辑/代码追踪领域的形式化证明与深度理解能力。
## 大语言模型性能评测/基准测试
所有活跃分支的近2年第一阶段试题已采用严格单提示词思维链方法进行评测。模型需按步骤逐步推理,并严格输出最终选项字母。
(作废("IPTAL")试题不计入准确率计算)
| 模型 | 总通过数 | 总失败数 | 作废(不计入) | 准确率 |
|:---|:---:|:---:|:---:|:---:|
| **Gemini 3.1 Pro** | 326 | 2 | 10 | **99.39%** |
| **Qwen3.5-397B-A17B + Thinking** | 319 | 10 | 9 | **96.96%** |
### 分领域概览
| 领域 | Qwen通过数 | Qwen失败数 | Qwen准确率 | Gemini通过数 | Gemini失败数 | Gemini准确率 |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| 计算机 | 148 | 6 | 96.10% | 152 | 1 | 99.35% |
| 数学 | 123 | 4 | 96.85% | 126 | 1 | 99.21% |
| 物理 | 48 | 0 | 100.00% | 48 | 0 | 100.00% |
## 来源与许可
原始试题来源于土耳其科学技术研究理事会(TUBITAK)组织的全国科学奥林匹克竞赛。
本标准化数据集以**CC BY 4.0**许可协议发布,仅供研究与教育用途。研究团队已从TUBITAK获得必要许可,方可发布此衍生基准数据集。
## 联系方式
COSMOS AI研究组
耶尔德兹技术大学计算机工程系
https://cosmos.yildiz.edu.tr/
cosmos@yildiz.edu.tr
提供机构:
sghosts



