ScienceOlympiad
收藏魔搭社区2026-01-06 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/ByteDance-Seed/ScienceOlympiad
下载链接
链接失效反馈官方服务:
资源简介:
# ScienceOlympiad: Challenging AI with Olympiad-Level Multimodal Science Problems
## Table of Contents
- [Dataset Description](#dataset-description)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [How to Use](#how-to-use)
- [Citation](#citation)
- [License](#license)
## Dataset Description
The **ScienceOlympiad** dataset is a meticulously curated benchmark designed to test the limits of current AI models in scientific reasoning. It comprises elite, competition-level problems in physics and chemistry. Addressing the need for more diverse and realistic challenges, **ScienceOlympiad** introduces multimodal integration as a key dimension. Unlike purely text-based datasets, a significant portion of problems requires models to analyze and interpret visual information from diagrams and figures—a critical skill for real-world scientific problem-solving. Our objective is for ScienceOlympiad to serve as a pivotal evaluation tool, accelerating progress in domain-specific reasoning for the physical and chemical sciences while advancing the frontier of integrated text and visual understanding.
## Data Fields
Each entry in the dataset represents a single problem and contains the following fields:
- `problem_id`: (`string`) - A unique identifier for each problem.
- `discipline`: (`ClassLabel`) - The scientific discipline of the problem (`Physics` or `Chemistry`).
- `problem`: (A dictionary of `string`) - A dictionary containing the problem statement in both English (`en`) and Chinese (`zh`).
- `answer`: (`string`) - The correct answer or solution.
- `image`: (`Image`) - An image or diagram associated with the problem, stored in the `images/` directory. This field will be `None` if no image is present.
## Data Splits
This dataset consists of a single **`test`** split, provided in the `test.parquet` file.
## Dataset Creation
The dataset is composed of a curated collection of problems from three sources: classic problems, adaptations of established problems, and original compositions. Each problem has been meticulously reviewed to ensure rigorous difficulty and unambiguous wording.
## How to Use
You can load the dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset, Image
import matplotlib.pyplot as plt
import datasets
# Force a redownload to get the latest metadata
features = datasets.Features({
'problem_id': datasets.Value('string'),
'discipline': datasets.ClassLabel(names=['Physics', 'Chemistry']),
'problem': {
'en': datasets.Value('string'),
'zh': datasets.Value('string'),
},
'answer': datasets.Value('string'),
'image': datasets.Image(decode=True),
})
ds = load_dataset("ByteDance-Seed/ScienceOlympiad", features=features)
test_ds = ds['test']
example = test_ds[60]
image_data = example['image']
if image_data:
plt.imshow(image_data)
plt.show()
```
## Citation
If you use the ScienceOlympiad dataset in your research or work, please consider citing it:
```bibtex
@misc{bytedance_seed_2025_scienceolympiad,
author = {[Bytedance-Seed]},
title = {ScienceOlympiad: Challenging AI with Olympiad-Level Multimodal Science Problems},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{[https://huggingface.co/datasets/Bytedance-Seed/ScienceOlympiad](https://huggingface.co/datasets/Bytedance-Seed/ScienceOlympiad)}},
}
```
## License
ScienceOlympiad is released under the **CC0 1.0 Universal (CC0 1.0) Public Domain Dedication**.

This means the work has been dedicated to the public domain by waiving all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute, and perform the work, even for commercial purposes, all without asking permission. For more details, see the [LICENSE](LICENSE) file or the [full legal text of the CC0 license](https://creativecommons.org/publicdomain/zero/1.0/legalcode).
# ScienceOlympiad:以奥赛级多模态科学问题挑战人工智能
## 目录
- [数据集描述](#dataset-description)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [使用方法](#how-to-use)
- [引用声明](#citation)
- [许可证](#license)
## 数据集描述
**ScienceOlympiad**数据集是一套经过精心构建的基准测试集,旨在检验当前人工智能模型在科学推理领域的能力边界。该数据集收录了物理与化学领域的顶尖竞赛级题目。为满足对更多样化、更贴合实际的挑战的需求,**ScienceOlympiad**将多模态融合作为核心维度之一。与纯文本数据集不同,本数据集中有相当比例的题目要求模型分析并解读图表中的视觉信息——这是现实世界科学解题的关键能力。我们的目标是使ScienceOlympiad成为一款关键的评估工具,在推动物理与化学领域专用推理能力发展的同时,促进文本与视觉融合理解的前沿研究进步。
## 数据字段
数据集中的每个条目对应一道独立题目,包含以下字段:
- `problem_id`:(字符串类型)——每道题目的唯一标识符。
- `discipline`:(分类标签)——题目所属的科学领域(`Physics`(物理学)或`Chemistry`(化学))。
- `problem`:(字符串字典)——包含题目英文(`en`)与中文(`zh`)表述的字典。
- `answer`:(字符串类型)——题目的正确答案或解题方案。
- `image`:(图像类型)——与题目相关的图像或图表,存储于`images/`目录下。若无配套图像,则该字段值为`None`。
## 数据划分
本数据集仅包含一个**`test`(测试集)**划分,存储于`test.parquet`文件中。
## 数据集构建
本数据集的题目经精心遴选,来源分为三类:经典题目、改编自已有题目的题目,以及原创题目。每道题目均经过严格审核,以确保难度达标且表述无歧义。
## 使用方法
您可通过Hugging Face的`datasets`库加载该数据集:
python
from datasets import load_dataset, Image
import matplotlib.pyplot as plt
import datasets
# Force a redownload to get the latest metadata
features = datasets.Features({
'problem_id': datasets.Value('string'),
'discipline': datasets.ClassLabel(names=['Physics', 'Chemistry']),
'problem': {
'en': datasets.Value('string'),
'zh': datasets.Value('string'),
},
'answer': datasets.Value('string'),
'image': datasets.Image(decode=True),
})
ds = load_dataset("ByteDance-Seed/ScienceOlympiad", features=features)
test_ds = ds['test']
example = test_ds[60]
image_data = example['image']
if image_data:
plt.imshow(image_data)
plt.show()
## 引用声明
若您在研究或工作中使用ScienceOlympiad数据集,请引用如下:
bibtex
@misc{bytedance_seed_2025_scienceolympiad,
author = {[Bytedance-Seed]},
title = {ScienceOlympiad: Challenging AI with Olympiad-Level Multimodal Science Problems},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {url{https://huggingface.co/datasets/Bytedance-Seed/ScienceOlympiad}},
}
## 许可证
ScienceOlympiad采用**CC0 1.0 通用公共域声明(CC0 1.0 Universal (CC0 1.0) Public Domain Dedication)**协议发布。

这意味着本作品已通过著作权法允许的最大范围,放弃其在全球范围内的所有著作权及相关、邻接权利,将作品投入公共域。您可在无需获得许可的情况下,对本作品进行复制、修改、分发或公开表演,甚至用于商业用途。如需了解更多详情,请查看[LICENSE](LICENSE)文件或[CC0许可证完整法律文本](https://creativecommons.org/publicdomain/zero/1.0/legalcode)。
提供机构:
maas
创建时间:
2025-06-18



