InteractScience
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Shanghai_AI_Laboratory/InteractScience
下载链接
链接失效反馈官方服务:
资源简介:
# InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
<p>
<a href='https://arxiv.org/abs/2510.09724'>
<img src='https://img.shields.io/badge/arXiv-2510.09724-b31b1b.svg'>
</a>
<a href="https://github.com/open-compass/InteractScience">
<img alt="GitHub" src="https://img.shields.io/badge/Github-InteractScience-000000?logo=github">
</a>
<a href="https://opensource.org/license/apache-2-0">
<img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-4285f4.svg?logo=apache">
</a>
</p>
InteractScience is a benchmark specifically designed to evaluate the capability of large language models in generating interactive scientific demonstration code. This project provides a complete evaluation pipeline including model inference, automated testing, and multi-dimensional assessment.

## 📊 Dataset Description
### interactscience.jsonl
Main dataset file, each line contains a test sample:
- `id`: Unique identifier
- `question`: Detailed HTML implementation plan
- `lm_system_prompt`: Language model system prompt
- `vlm_system_prompt`: Vision-language model system prompt
- `image_path`: List of reference screenshot paths
- `snapshot_checklists`: Visual verification checklists
### Reference Screenshots
Located in `data/snapshots/` directory, naming format:
- `{task_id}_Snapshot-{number}.png`
## 🚀 Usage Tutorial
### 1. Environment Setup
First install Node.js and npm, then install the Playwright testing environment:
```bash
# Install project dependencies
npm install
# Install Playwright browsers
npx playwright install
```
### 2. Model Inference
Use the `run_generation.sh` script for model inference:
```bash
# Edit the model path and parameters in the script
vim run_generation.sh
# Run inference (requires model path configuration)
bash run_generation.sh
```
**Script Description:**
- Starts vLLM API server
- Calls `test_llm.py` for inference
- Results saved to `eval/` directory
### 3. Automated Testing
Use the `run_benchmark.sh` script for automated testing:
```bash
# Set the model name to test
export MODEL="your_model_name"
# Run tests
bash run_benchmark.sh
```
**Testing Process:**
1. Extract HTML code from inference results (`extract_and_save_code.py`)
2. Execute Program Functionality Testing (PFT) using `playwright_PFT.config.js`
3. Execute Visual Quality Testing (VQT) using `playwright_VQT.config.js`
4. Calculate CLIP similarity scores (`clip_score.py`)
5. Results saved to `results/` directory
### 4. VLM Scoring
Use `run_vlm_as_judge.sh` for VLM-as-Judge evaluation:
```bash
# Edit model and path configuration in the script
vim run_vlm_as_judge.sh
# Run VLM scoring
bash run_vlm_as_judge.sh
```
**Scoring Description:**
- Uses vision-language models to score generated results
- Compares reference screenshots with generated screenshots
- Evaluation based on predefined checklists
### 5. Results Analysis
Use `cal_metrics.py` and `cal_vlm_as_judege_score.py` to calculate final metrics:
```bash
python cal_metrics.py
python cal_vlm_as_judege_score.py
```
## 🧪 Test Types
### 1. Program Functionality Testing (PFT)
- Validates functional correctness of HTML code
- Checks interactive element behavior
- Tests JavaScript logic
### 2. Visual Quality Testing (VQT)
- Generates page screenshots
- Compares with reference screenshots
- Calculates perceptual similarity (CLIP scores)
- Calculates semantic correctness (VLM-judge scores)
## 🛠️ Core Scripts Description
### test_llm.py
Language model testing main program:
```bash
python test_llm.py \
--dataset_path data/interactscience.jsonl \
--prompt_type lm_system_prompt \
--dump_path eval/result.jsonl \
--model_path your_model_path \
--base_url http://localhost:8000/v1 \
--api_key EMPTY
```
### vlm_as_judge.py
VLM scoring main program:
```bash
python vlm_as_judge.py \
--reference_image_dir data/snapshots \
--generated_image_dir generated_images \
--checklist_file data/checklists.jsonl \
--output_path results/vlm_judge.jsonl \
--base_url your_api_endpoint \
--api_key your_api_key
```
## 📈 Evaluation Metrics
- **Program Functionality Test Pass Rate**: Percentage of PFT test cases passed
- **Visual Quality Score**: Visual similarity based on CLIP model
- **VLM Score**: Comprehensive score given by multimodal models
## Experiments
We have evaluated 30 state-of-the-art large language models on the InteractScience benchmark. The results are available in the `results/` directory.
| **Model** | **PFT Overall (%)** | **PFT Average (%)** | **PFT Perfect (%)** | **VQT Action (%)** | **VQT CLIP** | **VQT VLM-judge** |
|----------------------------|---------------------|---------------------|---------------------|--------------------|--------------|-------------------|
| **Closed-Source Large Language Models** |||||||
| GPT-5 | 39.47 | **37.61** | **16.08** | 89.66 | 71.95 | **57.02** |
| GPT-4.1 | 37.07 | 34.08 | 11.19 | 89.15 | 71.21 | 52.84 |
| GPT-4o | 28.27 | 27.09 | 5.59 | 85.93 | 67.11 | 42.45 |
| o3 | 34.93 | 32.09 | 13.99 | 89.83 | 72.24 | 52.82 |
| o4-mini | 37.33 | 34.90 | 13.29 | 88.64 | 71.79 | 51.90 |
| Gemini-2.5-Pro | 35.33 | 34.62 | 11.19 | 86.78 | 70.65 | 54.69 |
| Gemini-2.5-Flash | 31.60 | 31.07 | 10.49 | 86.95 | 69.59 | 49.34 |
| Claude-Sonnet-4-20250514 | **41.47** | 37.40 | 13.29 | 89.66 | 73.50 | 55.42 |
| Claude-Opus-4-20250514 | 40.27 | 36.34 | 11.19 | 89.32 | **73.22** | 54.93 |
| Claude-3.5-Sonnet | 33.33 | 31.45 | 9.79 | **90.17** | 72.32 | 49.43 |
| **Open-Source Large Language Models** |||||||
| DeepSeek-R1-0528 | **33.87** | **32.02** | 8.39 | 88.31 | 69.54 | 49.46 |
| DeepSeek-V3-0324 | 31.73 | 30.57 | 10.49 | 85.93 | 68.68 | 49.46 |
| Kimi-K2 | 31.60 | 31.22 | 9.79 | 87.29 | 70.11 | **50.04** |
| GLM-4.5 | 29.33 | 26.65 | 8.39 | 70.51 | 55.90 | 38.57 |
| Intern-S1 | 31.87 | 28.93 | 7.69 | 87.46 | 68.74 | 45.27 |
| gpt-oss-120b | 28.00 | 27.78 | 9.79 | **90.85** | **72.13** | 49.57 |
| gpt-oss-20b | 15.20 | 12.97 | 3.50 | 80.51 | 54.68 | 21.40 |
| Qwen3-235B-A22B-Instruct-2507 | 33.33 | 31.46 | **13.29** | 78.14 | 70.02 | 45.14 |
| Qwen3-32B | 27.20 | 24.09 | 5.59 | 87.46 | 66.46 | 39.69 |
| Qwen3-14B | 24.13 | 23.58 | 7.69 | 85.08 | 66.46 | 36.53 |
| Qwen3-8B | 20.00 | 18.85 | 4.20 | 81.53 | 64.13 | 34.67 |
| Qwen3-4B | 14.67 | 13.10 | 2.80 | 82.03 | 60.90 | 28.33 |
| Qwen3-1.7B | 6.53 | 6.22 | 1.40 | 75.76 | 59.65 | 20.33 |
| Qwen2.5-Coder-32B-Instruct | 27.20 | 25.10 | 7.69 | 84.58 | 51.67 | 38.51 |
| Qwen2.5-Coder-14B-Instruct | 22.53 | 20.61 | 4.90 | 85.42 | 64.47 | 35.72 |
| Qwen2.5-Coder-7B-Instruct | 12.40 | 10.51 | 0.70 | 82.37 | 65.17 | 26.97 |
| Qwen2.5-VL-72B-Instruct | 23.73 | 22.82 | 6.99 | 87.12 | 64.33 | 37.30 |
| Qwen2.5-VL-7B-Instruct | 7.47 | 6.72 | 0.70 | 70.00 | 49.49 | 20.41 |
| Llama-3.1-70B-Instruct | 18.67 | 18.04 | 4.90 | 88.64 | 59.56 | 33.36 |
| Llama-3.1-8B-Instruct | 11.33 | 10.16 | 3.50 | 80.00 | 65.42 | 22.75 |
### Comparison Across Difficulty Levels

### Comparison Across Disciplines

### Results on Multimodal LLMs with Reference Snapshots as Input

## Example Cases


## Citation
```
@article{InteractScience,
author = {Qiaosheng Chen and Yang Liu and Lei Li and Kai Chen and Qipeng Guo and Gong Cheng and Fei Yuan},
title = {InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation},
journal = {arXiv preprint arXiv:2510.09724},
year = {2025}
}
```
# InteractScience: 交互式科学演示代码生成的程序化与视觉锚定评估
<p>
<a href='https://arxiv.org/abs/2510.09724'>
<img src='https://img.shields.io/badge/arXiv-2510.09724-b31b1b.svg'>
</a>
<a href="https://github.com/open-compass/InteractScience">
<img alt="GitHub" src="https://img.shields.io/badge/Github-InteractScience-000000?logo=github">
</a>
<a href="https://opensource.org/license/apache-2.0">
<img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-4285f4.svg?logo=apache">
</a>
</p>
InteractScience是专为评估大语言模型(Large Language Model, LLM)生成交互式科学演示代码的能力而设计的基准数据集。本项目提供了一套完整的评估流水线,涵盖模型推理、自动化测试与多维度评估。

## 📊 数据集说明
### interactscience.jsonl
主数据集文件,每行对应一个测试样本:
- `id`: 唯一标识符
- `question`: 详细的HTML实现方案
- `lm_system_prompt`: 大语言模型系统提示词
- `vlm_system_prompt`: 视觉语言模型(Vision-Language Model, VLM)系统提示词
- `image_path`: 参考截图路径列表
- `snapshot_checklists`: 视觉验证清单
### 参考截图
位于`data/snapshots/`目录,命名格式为:
- `{task_id}_Snapshot-{number}.png`
## 🚀 使用教程
### 1. 环境配置
首先安装Node.js与npm,随后安装Playwright测试环境:
bash
# 安装项目依赖
npm install
# 安装Playwright浏览器环境
npx playwright install
### 2. 模型推理
使用`run_generation.sh`脚本执行模型推理:
bash
# 编辑脚本中的模型路径与参数
vim run_generation.sh
# 运行推理(需配置模型路径)
bash run_generation.sh
**脚本说明**:
- 启动vLLM API服务器
- 调用`test_llm.py`执行推理
- 结果保存至`eval/`目录
### 3. 自动化测试
使用`run_benchmark.sh`脚本执行自动化测试:
bash
# 设置待测试的模型名称
export MODEL="your_model_name"
# 运行测试
bash run_benchmark.sh
**测试流程**:
1. 从推理结果中提取HTML代码(`extract_and_save_code.py`)
2. 使用`playwright_PFT.config.js`执行程序功能测试(Program Functionality Testing, PFT)
3. 使用`playwright_VQT.config.js`执行视觉质量测试(Visual Quality Testing, VQT)
4. 计算CLIP相似度得分(`clip_score.py`)
5. 结果保存至`results/`目录
### 4. 视觉语言模型评分
使用`run_vlm_as_judge.sh`执行基于视觉语言模型的裁判式评估:
bash
# 编辑脚本中的模型与路径配置
vim run_vlm_as_judge.sh
# 运行视觉语言模型评分
bash run_vlm_as_judge.sh
**评分说明**:
- 使用视觉语言模型对生成结果进行评分
- 对比参考截图与生成截图
- 基于预定义清单完成评估
### 5. 结果分析
使用`cal_metrics.py`与`cal_vlm_as_judege_score.py`计算最终指标:
bash
python cal_metrics.py
python cal_vlm_as_judege_score.py
## 🧪 测试类型
### 1. 程序功能测试(Program Functionality Testing, PFT)
- 验证HTML代码的功能正确性
- 检查交互元素的行为表现
- 测试JavaScript逻辑
### 2. 视觉质量测试(Visual Quality Testing, VQT)
- 生成页面截图
- 与参考截图进行对比
- 计算感知相似度(CLIP得分)
- 计算语义正确性(视觉语言模型裁判得分)
## 🛠️ 核心脚本说明
### test_llm.py
大语言模型测试主程序:
bash
python test_llm.py
--dataset_path data/interactscience.jsonl
--prompt_type lm_system_prompt
--dump_path eval/result.jsonl
--model_path your_model_path
--base_url http://localhost:8000/v1
--api_key EMPTY
### vlm_as_judge.py
视觉语言模型评分主程序:
bash
python vlm_as_judge.py
--reference_image_dir data/snapshots
--generated_image_dir generated_images
--checklist_file data/checklists.jsonl
--output_path results/vlm_judge.jsonl
--base_url your_api_endpoint
--api_key your_api_key
## 📈 评估指标
- **程序功能测试通过率**:通过的PFT测试用例占比
- **视觉质量得分**:基于CLIP模型的视觉相似度得分
- **视觉语言模型得分**:多模态模型给出的综合评分
## 实验
我们在InteractScience基准数据集上评估了30个当前主流的大语言模型,相关结果可在`results/`目录中获取。
| **模型** | **PFT整体通过率(%)** | **PFT平均通过率(%)** | **PFT完美通过率(%)** | **VQT操作准确率(%)** | **VQT CLIP得分** | **VQT 视觉语言模型裁判得分** |
|----------------------------|---------------------|---------------------|---------------------|--------------------|--------------|-------------------|
| **闭源大语言模型** |||||||
| GPT-5 | 39.47 | **37.61** | **16.08** | 89.66 | 71.95 | **57.02** |
| GPT-4.1 | 37.07 | 34.08 | 11.19 | 89.15 | 71.21 | 52.84 |
| GPT-4o | 28.27 | 27.09 | 5.59 | 85.93 | 67.11 | 42.45 |
| o3 | 34.93 | 32.09 | 13.99 | 89.83 | 72.24 | 52.82 |
| o4-mini | 37.33 | 34.90 | 13.29 | 88.64 | 71.79 | 51.90 |
| Gemini-2.5-Pro | 35.33 | 34.62 | 11.19 | 86.78 | 70.65 | 54.69 |
| Gemini-2.5-Flash | 31.60 | 31.07 | 10.49 | 86.95 | 69.59 | 49.34 |
| Claude-Sonnet-4-20250514 | **41.47** | 37.40 | 13.29 | 89.66 | 73.50 | 55.42 |
| Claude-Opus-4-20250514 | 40.27 | 36.34 | 11.19 | 89.32 | **73.22** | 54.93 |
| Claude-3.5-Sonnet | 33.33 | 31.45 | 9.79 | **90.17** | 72.32 | 49.43 |
| **开源大语言模型** |||||||
| DeepSeek-R1-0528 | **33.87** | **32.02** | 8.39 | 88.31 | 69.54 | 49.46 |
| DeepSeek-V3-0324 | 31.73 | 30.57 | 10.49 | 85.93 | 68.68 | 49.46 |
| Kimi-K2 | 31.60 | 31.22 | 9.79 | 87.29 | 70.11 | **50.04** |
| GLM-4.5 | 29.33 | 26.65 | 8.39 | 70.51 | 55.90 | 38.57 |
| Intern-S1 | 31.87 | 28.93 | 7.69 | 87.46 | 68.74 | 45.27 |
| gpt-oss-120b | 28.00 | 27.78 | 9.79 | **90.85** | **72.13** | 49.57 |
| gpt-oss-20b | 15.20 | 12.97 | 3.50 | 80.51 | 54.68 | 21.40 |
| Qwen3-235B-A22B-Instruct-2507 | 33.33 | 31.46 | **13.29** | 78.14 | 70.02 | 45.14 |
| Qwen3-32B | 27.20 | 24.09 | 5.59 | 87.46 | 66.46 | 39.69 |
| Qwen3-14B | 24.13 | 23.58 | 7.69 | 85.08 | 66.46 | 36.53 |
| Qwen3-8B | 20.00 | 18.85 | 4.20 | 81.53 | 64.13 | 34.67 |
| Qwen3-4B | 14.67 | 13.10 | 2.80 | 82.03 | 60.90 | 28.33 |
| Qwen3-1.7B | 6.53 | 6.22 | 1.40 | 75.76 | 59.65 | 20.33 |
| Qwen2.5-Coder-32B-Instruct | 27.20 | 25.10 | 7.69 | 84.58 | 51.67 | 38.51 |
| Qwen2.5-Coder-14B-Instruct | 22.53 | 20.61 | 4.90 | 85.42 | 64.47 | 35.72 |
| Qwen2.5-Coder-7B-Instruct | 12.40 | 10.51 | 0.70 | 82.37 | 65.17 | 26.97 |
| Qwen2.5-VL-72B-Instruct | 23.73 | 22.82 | 6.99 | 87.12 | 64.33 | 37.30 |
| Qwen2.5-VL-7B-Instruct | 7.47 | 6.72 | 0.70 | 70.00 | 49.49 | 20.41 |
| Llama-3.1-70B-Instruct | 18.67 | 18.04 | 4.90 | 88.64 | 59.56 | 33.36 |
| Llama-3.1-8B-Instruct | 11.33 | 10.16 | 3.50 | 80.00 | 65.42 | 22.75 |
### 不同难度级别下的性能对比

### 不同学科领域下的性能对比

### 以参考快照作为输入的多模态大语言模型性能对比

## 示例案例


## 引用
@article{InteractScience,
author = {陈乔生 and 刘阳 and 李磊 and 陈凯 and 郭启鹏 and 程功 and 袁飞},
title = {InteractScience: 交互式科学演示代码生成的程序化与视觉锚定评估},
journal = {arXiv预印本 arXiv:2510.09724},
year = {2025}
}
提供机构:
maas
创建时间:
2025-10-31



