下载链接：

https://modelscope.cn/datasets/Shanghai_AI_Laboratory/InteractScience

下载链接

链接失效反馈

官方服务：

资源简介：

# InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation <p> <a href='https://arxiv.org/abs/2510.09724'> <img src='https://img.shields.io/badge/arXiv-2510.09724-b31b1b.svg'> </a>     <a href="https://github.com/open-compass/InteractScience"> <img alt="GitHub" src="https://img.shields.io/badge/Github-InteractScience-000000?logo=github"> </a>     <a href="https://opensource.org/license/apache-2-0"> <img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-4285f4.svg?logo=apache"> </a> </p> InteractScience is a benchmark specifically designed to evaluate the capability of large language models in generating interactive scientific demonstration code. This project provides a complete evaluation pipeline including model inference, automated testing, and multi-dimensional assessment. ![](figs/hook.svg) ## 📊 Dataset Description ### interactscience.jsonl Main dataset file, each line contains a test sample: - `id`: Unique identifier - `question`: Detailed HTML implementation plan - `lm_system_prompt`: Language model system prompt - `vlm_system_prompt`: Vision-language model system prompt - `image_path`: List of reference screenshot paths - `snapshot_checklists`: Visual verification checklists ### Reference Screenshots Located in `data/snapshots/` directory, naming format: - `{task_id}_Snapshot-{number}.png` ## 🚀 Usage Tutorial ### 1. Environment Setup First install Node.js and npm, then install the Playwright testing environment: ```bash # Install project dependencies npm install # Install Playwright browsers npx playwright install ``` ### 2. Model Inference Use the `run_generation.sh` script for model inference: ```bash # Edit the model path and parameters in the script vim run_generation.sh # Run inference (requires model path configuration) bash run_generation.sh ``` **Script Description:** - Starts vLLM API server - Calls `test_llm.py` for inference - Results saved to `eval/` directory ### 3. Automated Testing Use the `run_benchmark.sh` script for automated testing: ```bash # Set the model name to test export MODEL="your_model_name" # Run tests bash run_benchmark.sh ``` **Testing Process:** 1. Extract HTML code from inference results (`extract_and_save_code.py`) 2. Execute Program Functionality Testing (PFT) using `playwright_PFT.config.js` 3. Execute Visual Quality Testing (VQT) using `playwright_VQT.config.js` 4. Calculate CLIP similarity scores (`clip_score.py`) 5. Results saved to `results/` directory ### 4. VLM Scoring Use `run_vlm_as_judge.sh` for VLM-as-Judge evaluation: ```bash # Edit model and path configuration in the script vim run_vlm_as_judge.sh # Run VLM scoring bash run_vlm_as_judge.sh ``` **Scoring Description:** - Uses vision-language models to score generated results - Compares reference screenshots with generated screenshots - Evaluation based on predefined checklists ### 5. Results Analysis Use `cal_metrics.py` and `cal_vlm_as_judege_score.py` to calculate final metrics: ```bash python cal_metrics.py python cal_vlm_as_judege_score.py ``` ## 🧪 Test Types ### 1. Program Functionality Testing (PFT) - Validates functional correctness of HTML code - Checks interactive element behavior - Tests JavaScript logic ### 2. Visual Quality Testing (VQT) - Generates page screenshots - Compares with reference screenshots - Calculates perceptual similarity (CLIP scores) - Calculates semantic correctness (VLM-judge scores) ## 🛠️ Core Scripts Description ### test_llm.py Language model testing main program: ```bash python test_llm.py \ --dataset_path data/interactscience.jsonl \ --prompt_type lm_system_prompt \ --dump_path eval/result.jsonl \ --model_path your_model_path \ --base_url http://localhost:8000/v1 \ --api_key EMPTY ``` ### vlm_as_judge.py VLM scoring main program: ```bash python vlm_as_judge.py \ --reference_image_dir data/snapshots \ --generated_image_dir generated_images \ --checklist_file data/checklists.jsonl \ --output_path results/vlm_judge.jsonl \ --base_url your_api_endpoint \ --api_key your_api_key ``` ## 📈 Evaluation Metrics - **Program Functionality Test Pass Rate**: Percentage of PFT test cases passed - **Visual Quality Score**: Visual similarity based on CLIP model - **VLM Score**: Comprehensive score given by multimodal models ## Experiments We have evaluated 30 state-of-the-art large language models on the InteractScience benchmark. The results are available in the `results/` directory. | **Model** | **PFT Overall (%)** | **PFT Average (%)** | **PFT Perfect (%)** | **VQT Action (%)** | **VQT CLIP** | **VQT VLM-judge** | |----------------------------|---------------------|---------------------|---------------------|--------------------|--------------|-------------------| | **Closed-Source Large Language Models** ||||||| | GPT-5 | 39.47 | **37.61** | **16.08** | 89.66 | 71.95 | **57.02** | | GPT-4.1 | 37.07 | 34.08 | 11.19 | 89.15 | 71.21 | 52.84 | | GPT-4o | 28.27 | 27.09 | 5.59 | 85.93 | 67.11 | 42.45 | | o3 | 34.93 | 32.09 | 13.99 | 89.83 | 72.24 | 52.82 | | o4-mini | 37.33 | 34.90 | 13.29 | 88.64 | 71.79 | 51.90 | | Gemini-2.5-Pro | 35.33 | 34.62 | 11.19 | 86.78 | 70.65 | 54.69 | | Gemini-2.5-Flash | 31.60 | 31.07 | 10.49 | 86.95 | 69.59 | 49.34 | | Claude-Sonnet-4-20250514 | **41.47** | 37.40 | 13.29 | 89.66 | 73.50 | 55.42 | | Claude-Opus-4-20250514 | 40.27 | 36.34 | 11.19 | 89.32 | **73.22** | 54.93 | | Claude-3.5-Sonnet | 33.33 | 31.45 | 9.79 | **90.17** | 72.32 | 49.43 | | **Open-Source Large Language Models** ||||||| | DeepSeek-R1-0528 | **33.87** | **32.02** | 8.39 | 88.31 | 69.54 | 49.46 | | DeepSeek-V3-0324 | 31.73 | 30.57 | 10.49 | 85.93 | 68.68 | 49.46 | | Kimi-K2 | 31.60 | 31.22 | 9.79 | 87.29 | 70.11 | **50.04** | | GLM-4.5 | 29.33 | 26.65 | 8.39 | 70.51 | 55.90 | 38.57 | | Intern-S1 | 31.87 | 28.93 | 7.69 | 87.46 | 68.74 | 45.27 | | gpt-oss-120b | 28.00 | 27.78 | 9.79 | **90.85** | **72.13** | 49.57 | | gpt-oss-20b | 15.20 | 12.97 | 3.50 | 80.51 | 54.68 | 21.40 | | Qwen3-235B-A22B-Instruct-2507 | 33.33 | 31.46 | **13.29** | 78.14 | 70.02 | 45.14 | | Qwen3-32B | 27.20 | 24.09 | 5.59 | 87.46 | 66.46 | 39.69 | | Qwen3-14B | 24.13 | 23.58 | 7.69 | 85.08 | 66.46 | 36.53 | | Qwen3-8B | 20.00 | 18.85 | 4.20 | 81.53 | 64.13 | 34.67 | | Qwen3-4B | 14.67 | 13.10 | 2.80 | 82.03 | 60.90 | 28.33 | | Qwen3-1.7B | 6.53 | 6.22 | 1.40 | 75.76 | 59.65 | 20.33 | | Qwen2.5-Coder-32B-Instruct | 27.20 | 25.10 | 7.69 | 84.58 | 51.67 | 38.51 | | Qwen2.5-Coder-14B-Instruct | 22.53 | 20.61 | 4.90 | 85.42 | 64.47 | 35.72 | | Qwen2.5-Coder-7B-Instruct | 12.40 | 10.51 | 0.70 | 82.37 | 65.17 | 26.97 | | Qwen2.5-VL-72B-Instruct | 23.73 | 22.82 | 6.99 | 87.12 | 64.33 | 37.30 | | Qwen2.5-VL-7B-Instruct | 7.47 | 6.72 | 0.70 | 70.00 | 49.49 | 20.41 | | Llama-3.1-70B-Instruct | 18.67 | 18.04 | 4.90 | 88.64 | 59.56 | 33.36 | | Llama-3.1-8B-Instruct | 11.33 | 10.16 | 3.50 | 80.00 | 65.42 | 22.75 | ### Comparison Across Difficulty Levels ![](figs/model_performance_comparison_difficulty.svg) ### Comparison Across Disciplines ![](figs/model_performance_comparison_discipline.svg) ### Results on Multimodal LLMs with Reference Snapshots as Input ![](figs/model_performance_vs_images.svg) ## Example Cases ![](figs/example1.svg) ![](figs/example2.svg) ## Citation ``` @article{InteractScience, author = {Qiaosheng Chen and Yang Liu and Lei Li and Kai Chen and Qipeng Guo and Gong Cheng and Fei Yuan}, title = {InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation}, journal = {arXiv preprint arXiv:2510.09724}, year = {2025} } ```

# InteractScience: 交互式科学演示代码生成的程序化与视觉锚定评估 <p> <a href='https://arxiv.org/abs/2510.09724'> <img src='https://img.shields.io/badge/arXiv-2510.09724-b31b1b.svg'> </a>     <a href="https://github.com/open-compass/InteractScience"> <img alt="GitHub" src="https://img.shields.io/badge/Github-InteractScience-000000?logo=github"> </a>     <a href="https://opensource.org/license/apache-2.0"> <img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-4285f4.svg?logo=apache"> </a> </p> InteractScience是专为评估大语言模型（Large Language Model, LLM）生成交互式科学演示代码的能力而设计的基准数据集。本项目提供了一套完整的评估流水线，涵盖模型推理、自动化测试与多维度评估。 ![](figs/hook.svg) ## 📊 数据集说明 ### interactscience.jsonl 主数据集文件，每行对应一个测试样本： - `id`: 唯一标识符 - `question`: 详细的HTML实现方案 - `lm_system_prompt`: 大语言模型系统提示词 - `vlm_system_prompt`: 视觉语言模型（Vision-Language Model, VLM）系统提示词 - `image_path`: 参考截图路径列表 - `snapshot_checklists`: 视觉验证清单 ### 参考截图位于`data/snapshots/`目录，命名格式为： - `{task_id}_Snapshot-{number}.png` ## 🚀 使用教程 ### 1. 环境配置首先安装Node.js与npm，随后安装Playwright测试环境： bash # 安装项目依赖 npm install # 安装Playwright浏览器环境 npx playwright install ### 2. 模型推理使用`run_generation.sh`脚本执行模型推理： bash # 编辑脚本中的模型路径与参数 vim run_generation.sh # 运行推理（需配置模型路径） bash run_generation.sh **脚本说明**： - 启动vLLM API服务器 - 调用`test_llm.py`执行推理 - 结果保存至`eval/`目录 ### 3. 自动化测试使用`run_benchmark.sh`脚本执行自动化测试： bash # 设置待测试的模型名称 export MODEL="your_model_name" # 运行测试 bash run_benchmark.sh **测试流程**： 1. 从推理结果中提取HTML代码（`extract_and_save_code.py`） 2. 使用`playwright_PFT.config.js`执行程序功能测试（Program Functionality Testing, PFT） 3. 使用`playwright_VQT.config.js`执行视觉质量测试（Visual Quality Testing, VQT） 4. 计算CLIP相似度得分（`clip_score.py`） 5. 结果保存至`results/`目录 ### 4. 视觉语言模型评分使用`run_vlm_as_judge.sh`执行基于视觉语言模型的裁判式评估： bash # 编辑脚本中的模型与路径配置 vim run_vlm_as_judge.sh # 运行视觉语言模型评分 bash run_vlm_as_judge.sh **评分说明**： - 使用视觉语言模型对生成结果进行评分 - 对比参考截图与生成截图 - 基于预定义清单完成评估 ### 5. 结果分析使用`cal_metrics.py`与`cal_vlm_as_judege_score.py`计算最终指标： bash python cal_metrics.py python cal_vlm_as_judege_score.py ## 🧪 测试类型 ### 1. 程序功能测试（Program Functionality Testing, PFT） - 验证HTML代码的功能正确性 - 检查交互元素的行为表现 - 测试JavaScript逻辑 ### 2. 视觉质量测试（Visual Quality Testing, VQT） - 生成页面截图 - 与参考截图进行对比 - 计算感知相似度（CLIP得分） - 计算语义正确性（视觉语言模型裁判得分） ## 🛠️ 核心脚本说明 ### test_llm.py 大语言模型测试主程序： bash python test_llm.py --dataset_path data/interactscience.jsonl --prompt_type lm_system_prompt --dump_path eval/result.jsonl --model_path your_model_path --base_url http://localhost:8000/v1 --api_key EMPTY ### vlm_as_judge.py 视觉语言模型评分主程序： bash python vlm_as_judge.py --reference_image_dir data/snapshots --generated_image_dir generated_images --checklist_file data/checklists.jsonl --output_path results/vlm_judge.jsonl --base_url your_api_endpoint --api_key your_api_key ## 📈 评估指标 - **程序功能测试通过率**：通过的PFT测试用例占比 - **视觉质量得分**：基于CLIP模型的视觉相似度得分 - **视觉语言模型得分**：多模态模型给出的综合评分 ## 实验我们在InteractScience基准数据集上评估了30个当前主流的大语言模型，相关结果可在`results/`目录中获取。 | **模型** | **PFT整体通过率（%）** | **PFT平均通过率（%）** | **PFT完美通过率（%）** | **VQT操作准确率（%）** | **VQT CLIP得分** | **VQT 视觉语言模型裁判得分** | |----------------------------|---------------------|---------------------|---------------------|--------------------|--------------|-------------------| | **闭源大语言模型** ||||||| | GPT-5 | 39.47 | **37.61** | **16.08** | 89.66 | 71.95 | **57.02** | | GPT-4.1 | 37.07 | 34.08 | 11.19 | 89.15 | 71.21 | 52.84 | | GPT-4o | 28.27 | 27.09 | 5.59 | 85.93 | 67.11 | 42.45 | | o3 | 34.93 | 32.09 | 13.99 | 89.83 | 72.24 | 52.82 | | o4-mini | 37.33 | 34.90 | 13.29 | 88.64 | 71.79 | 51.90 | | Gemini-2.5-Pro | 35.33 | 34.62 | 11.19 | 86.78 | 70.65 | 54.69 | | Gemini-2.5-Flash | 31.60 | 31.07 | 10.49 | 86.95 | 69.59 | 49.34 | | Claude-Sonnet-4-20250514 | **41.47** | 37.40 | 13.29 | 89.66 | 73.50 | 55.42 | | Claude-Opus-4-20250514 | 40.27 | 36.34 | 11.19 | 89.32 | **73.22** | 54.93 | | Claude-3.5-Sonnet | 33.33 | 31.45 | 9.79 | **90.17** | 72.32 | 49.43 | | **开源大语言模型** ||||||| | DeepSeek-R1-0528 | **33.87** | **32.02** | 8.39 | 88.31 | 69.54 | 49.46 | | DeepSeek-V3-0324 | 31.73 | 30.57 | 10.49 | 85.93 | 68.68 | 49.46 | | Kimi-K2 | 31.60 | 31.22 | 9.79 | 87.29 | 70.11 | **50.04** | | GLM-4.5 | 29.33 | 26.65 | 8.39 | 70.51 | 55.90 | 38.57 | | Intern-S1 | 31.87 | 28.93 | 7.69 | 87.46 | 68.74 | 45.27 | | gpt-oss-120b | 28.00 | 27.78 | 9.79 | **90.85** | **72.13** | 49.57 | | gpt-oss-20b | 15.20 | 12.97 | 3.50 | 80.51 | 54.68 | 21.40 | | Qwen3-235B-A22B-Instruct-2507 | 33.33 | 31.46 | **13.29** | 78.14 | 70.02 | 45.14 | | Qwen3-32B | 27.20 | 24.09 | 5.59 | 87.46 | 66.46 | 39.69 | | Qwen3-14B | 24.13 | 23.58 | 7.69 | 85.08 | 66.46 | 36.53 | | Qwen3-8B | 20.00 | 18.85 | 4.20 | 81.53 | 64.13 | 34.67 | | Qwen3-4B | 14.67 | 13.10 | 2.80 | 82.03 | 60.90 | 28.33 | | Qwen3-1.7B | 6.53 | 6.22 | 1.40 | 75.76 | 59.65 | 20.33 | | Qwen2.5-Coder-32B-Instruct | 27.20 | 25.10 | 7.69 | 84.58 | 51.67 | 38.51 | | Qwen2.5-Coder-14B-Instruct | 22.53 | 20.61 | 4.90 | 85.42 | 64.47 | 35.72 | | Qwen2.5-Coder-7B-Instruct | 12.40 | 10.51 | 0.70 | 82.37 | 65.17 | 26.97 | | Qwen2.5-VL-72B-Instruct | 23.73 | 22.82 | 6.99 | 87.12 | 64.33 | 37.30 | | Qwen2.5-VL-7B-Instruct | 7.47 | 6.72 | 0.70 | 70.00 | 49.49 | 20.41 | | Llama-3.1-70B-Instruct | 18.67 | 18.04 | 4.90 | 88.64 | 59.56 | 33.36 | | Llama-3.1-8B-Instruct | 11.33 | 10.16 | 3.50 | 80.00 | 65.42 | 22.75 | ### 不同难度级别下的性能对比 ![](figs/model_performance_comparison_difficulty.svg) ### 不同学科领域下的性能对比 ![](figs/model_performance_comparison_discipline.svg) ### 以参考快照作为输入的多模态大语言模型性能对比 ![](figs/model_performance_vs_images.svg) ## 示例案例 ![](figs/example1.svg) ![](figs/example2.svg) ## 引用 @article{InteractScience, author = {陈乔生 and 刘阳 and 李磊 and 陈凯 and 郭启鹏 and 程功 and 袁飞}, title = {InteractScience: 交互式科学演示代码生成的程序化与视觉锚定评估}, journal = {arXiv预印本 arXiv:2510.09724}, year = {2025} }

应用场景：