SeedBench
收藏魔搭社区2025-11-24 更新2025-04-19 收录
下载链接:
https://modelscope.cn/datasets/y12869741/SeedBench
下载链接
链接失效反馈官方服务:
资源简介:
# SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science
SeedBench is the first multi-task benchmark designed to evaluate large language models (LLMs) in seed science, focusing on seed breeding. This repository includes the dataset, evaluation code, and documentation to support research in this domain.
[GitHub page](https://github.com/open-sciencelab/SeedBench)
## Overview
SeedBench assesses LLMs across three core seed breeding stages:
- Gene Information Retrieval
- Gene Function and Regulation Analysis
- Variety Breeding with Agronomic Trait Optimization
Built with domain experts, SeedBench features 2,264 expert-validated questions across 11 task types and 10 subcategories, initially targeting rice breeding. Future updates will include other crops like maize, soybean, and wheat.
## Dataset Details
- Corpus: 308,727 publications cleaned to 1.1 billion tokens; 279 segments from 113 documents.
- Questions: 2,264 across 11 task types, bilingual (English/Chinese), expert-validated.
- Focus: Rice breeding as a representative case.
### Performance by Task Types
| Model | QA-1 | QA-2 | QA-3 | QA-4 | SUM-1 | SUM-2 | RC-1 | RC-2 | RC-3 | RC-4 | RC-5 | Avg |
|------------------|------|------|------|------|-------|-------|------|------|------|------|------|------|
| GPT-4 | 60.50| 73.87| 21.35| 36.07| 58.73 | 62.89 | 100.00| 96.44| 87.86| 62.29| 86.74| 67.88|
| DeepSeek-V3 | 72.50| 79.84| 29.29| 40.63| 48.06 | 54.67 | 100.00| 97.22| 87.89| 55.19| 86.74| 68.37|
| Qwen2-72B | 59.50| 75.98| 19.55| 31.62| 31.08 | 63.09 | 99.12 | 94.24| 72.20| 51.58| 89.96| 62.54|
### Performance by Subcategory
| Model | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | Avg |
|-------------------|------|------|------|------|------|------|------|------|------|------|------|
| GPT-4 | 59.59| 60.55| 76.32| 61.16| 56.34| 59.35| 63.67| 64.74| 60.65| 67.66| 62.06|
| DeepSeek-V3-671B | 56.03| 62.42| 74.81| 63.17| 55.23| 58.84| 68.23| 69.04| 66.46| 68.48| 63.30|
| Qwen2-72B | 51.16| 58.10| 74.07| 59.72| 51.58| 57.76| 58.85| 61.63| 56.69| 59.11| 57.62|
## Key Results
We evaluated 26 LLMs, including proprietary, open-source, and domain-specific models.
- Top Performers by Question Type: DeepSeek-V3 (68.37), GPT-4 (67.88).
- Top Performers by Subcategory: DeepSeek-V3-671B (63.30), GPT-4 (62.06).
## Citation
For more comprehensive information, please refer to the [paper](https://arxiv.org/abs/2505.13220).
```bibtex
@inproceedings{ying2025seedbench,
title={SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science},
author={Ying, Jie and Chen, Zihong and Wang, Zhefan and Jiang, Wanli and Wang, Chenyang and Yuan, Zhonghang and Su, Haoyang and Kong, Huanjun and Yang, Fan and Dong, Nanqing},
booktitle={Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year={2025}
}
```
# SeedBench:面向种子科学领域大语言模型(Large Language Model,LLM)评估的多任务基准
SeedBench是首个专为种子科学领域设计的大语言模型多任务评估基准,聚焦种子育种方向。本仓库包含支撑该领域研究所需的数据集、评估代码与文档资料。
[GitHub页面](https://github.com/open-sciencelab/SeedBench)
## 概述
SeedBench从三大核心种子育种阶段对大语言模型进行评估:
- 基因信息检索
- 基因功能与调控分析
- 农艺性状优化的品种选育
本基准由领域专家参与构建,涵盖11种任务类型、10个子分类下的2264道经专家验证的问题,初始聚焦水稻育种。未来的版本更新将纳入玉米、大豆、小麦等其他作物品类。
## 数据集详情
- 语料库:308727篇经过清洗的学术文献,总计11亿Token;包含113份文档中的279个片段。
- 问题集:覆盖11种任务类型的2264道问题,支持中英双语,均经过专家验证。
- 聚焦方向:以水稻育种作为代表性研究案例。
### 按任务类型划分的模型性能
| 模型 | QA-1 | QA-2 | QA-3 | QA-4 | SUM-1 | SUM-2 | RC-1 | RC-2 | RC-3 | RC-4 | RC-5 | 平均 |
|------------------|------|------|------|------|-------|-------|------|------|------|------|------|------|
| GPT-4 | 60.50| 73.87| 21.35| 36.07| 58.73 | 62.89 | 100.00| 96.44| 87.86| 62.29| 86.74| 67.88|
| DeepSeek-V3 | 72.50| 79.84| 29.29| 40.63| 48.06 | 54.67 | 100.00| 97.22| 87.89| 55.19| 86.74| 68.37|
| Qwen2-72B | 59.50| 75.98| 19.55| 31.62| 31.08 | 63.09 | 99.12 | 94.24| 72.20| 51.58| 89.96| 62.54|
### 按子分类划分的模型性能
| 模型 | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | 平均 |
|-------------------|------|------|------|------|------|------|------|------|------|------|------|
| GPT-4 | 59.59| 60.55| 76.32| 61.16| 56.34| 59.35| 63.67| 64.74| 60.65| 67.66| 62.06|
| DeepSeek-V3-671B | 56.03| 62.42| 74.81| 63.17| 55.23| 58.84| 68.23| 69.04| 66.46| 68.48| 63.30|
| Qwen2-72B | 51.16| 58.10| 74.07| 59.72| 51.58| 57.76| 58.85| 61.63| 56.69| 59.11| 57.62|
## 核心实验结果
本研究共评估了26款大语言模型,涵盖闭源、开源以及领域专用模型。
- 按问题类型划分的最优模型:DeepSeek-V3(平均得分68.37)、GPT-4(平均得分67.88)。
- 按子分类划分的最优模型:DeepSeek-V3-671B(平均得分63.30)、GPT-4(平均得分62.06)。
## 引用信息
如需获取更全面的研究细节,请参阅[论文](https://arxiv.org/abs/2505.13220)。
bibtex
@inproceedings{ying2025seedbench,
title={SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science},
author={Ying, Jie and Chen, Zihong and Wang, Zhefan and Jiang, Wanli and Wang, Chenyang and Yuan, Zhonghang and Su, Haoyang and Kong, Huanjun and Yang, Fan and Dong, Nanqing},
booktitle={Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year={2025}
}
提供机构:
maas
创建时间:
2025-04-14



