MaterialsQA-SFT
收藏魔搭社区2025-11-20 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/morethankk/MaterialsQA-SFT
下载链接
链接失效反馈官方服务:
资源简介:
# MaterialsQA-SFT: A Question-Answering Dataset for Materials Science LLM Evaluation
## Dataset Card for MaterialsQA-SFT
### Dataset Summary
MaterialsQA-SFT is a synthetically generated question-answering dataset designed to evaluate the generation accuracy of large language models (LLMs), particularly in materials science domains including:
- High-Entropy Alloys (HEAs)
- Thermal Barrier Coatings (TBCs)
- High-Temperature Oxidation
The dataset consists of 902 expert-verified question-answer pairs, distilled and reformulated from recent authoritative review articles. It aims to benchmark LLM performance in materials-specific knowledge extraction, reasoning, and summarization tasks.
### Source Literature
The question-answer pairs were curated from the following comprehensive review papers:
High-Entropy Alloys
[George et al., Acta Materialia, 2020](https://doi.org/10.1016/j.actamat.2019.12.015)
Thermal Barrier Coatings
[Mondal, Industrial & Engineering Chemistry Research, 2021](https://pubs.acs.org/doi/10.1021/acs.iecr.1c00788?ref=pdf)
High-Temperature Oxidation
[Gao et al., Progress in Materials Science, 2025](https://doi.org/10.1016/j.pmatsci.2024.101348)
Extracted questions and answers were paraphrased to enhance clarity while retaining technical accuracy, and subsequently verified by materials science experts.
### Dataset Structure
The dataset is provided in two commonly used formats to facilitate diverse LLM training and evaluation pipelines:
1. EvalScope Format
Each sample contains:
query: The materials science question
response: The expert-verified answer
3. Alpaca Format
Each sample contains:
messages: A list of two messages in conversational format
First message: role = "user", content = question
Second message: role = "assistant", content = answer
### Intended Uses
- Benchmarking LLM performance in materials science knowledge generation, extraction, summarization, and reasoning tasks.
- Fine-tuning or instruction tuning materials science specialized LLMs (ensure proper license compliance if used for training).
### Baseline Model Performance
The dataset was used to evaluate several recent models with the following qualitative observations(batch_size=8):
| Model | Rouge-1-R | Rouge-1-P | Rouge-1-F | Rouge-2-R | Rouge-2-P | Rouge-2-F | Rouge-L-R | Rouge-L-P | Rouge-L-F | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| Deepseek-R1 | 0.6238 | 0.1955 | 0.2919 | 0.2375 | 0.0482 | 0.0783 | 0.4343 | 0.0675 | 0.1140 | 0.1011 | 0.0394 | 0.0167 | 0.0079 |
| Deepseek-V3 | 0.5452 | 0.2670 | 0.3509 | 0.2031 | 0.0793 | 0.1106 | 0.3749 | 0.1191 | 0.1748 | 0.1965 | 0.0772 | 0.0373 | 0.0196 |
| Qwen3-4B | 0.4099 | 0.2749 | 0.3203 | 0.1651 | 0.0820 | 0.1056 | 0.3299 | 0.1359 | 0.1849 | 0.2271 | 0.0806 | 0.036 | 0.0196 |
| Qwen3-0.6B | 0.3840 | 0.2722 | 0.3095 | 0.1596 | 0.0835 | 0.1054 | 0.3302 | 0.1332 | 0.1825 | 0.2191 | 0.0824 | 0.0416 | 0.0251 |
### Licensing
Please ensure compliance with the source literature’s usage rights. This dataset is released under CC BY 4.0.
### Disclaimer
This dataset is provided “as is” without any warranty of accuracy, completeness, or fitness for a particular purpose. While the questions and answers have been verified by experts to the best of our ability, users should independently verify any critical information derived from this dataset before using it in academic, industrial, or safety-critical applications. The authors and maintainers assume no responsibility or liability for any errors or omissions in the content or for any outcomes resulting from its use.
# MaterialsQA-SFT:面向材料科学大语言模型评估的问答数据集
## MaterialsQA-SFT 数据集卡片
### 数据集概述
MaterialsQA-SFT是一个合成生成的问答数据集,旨在评估大语言模型(Large Language Model, LLM)的生成准确性,尤其聚焦于以下材料科学细分领域:
- 高熵合金(High-Entropy Alloys, HEAs)
- 热障涂层(Thermal Barrier Coatings, TBCs)
- 高温氧化(High-Temperature Oxidation)
该数据集包含902条经材料科学专家验证的问答对,均从近期权威综述文章中提炼并重新表述而成。其核心目标是基准测试大语言模型在材料领域的知识抽取、推理与摘要任务中的表现。
### 来源文献
本数据集的问答对整理自以下综合综述论文:
- 高熵合金领域:[George等人,《Acta Materialia》,2020](https://doi.org/10.1016/j.actamat.2019.12.015)
- 热障涂层领域:[Mondal,《Industrial & Engineering Chemistry Research》,2021](https://pubs.acs.org/doi/10.1021/acs.iecr.1c00788?ref=pdf)
- 高温氧化领域:[Gao等人,《Progress in Materials Science》,2025](https://doi.org/10.1016/j.pmatsci.2024.101348)
抽取得到的原始问答对经改写以提升表述清晰度,同时完整保留技术准确性,随后由材料科学领域专家完成最终验证。
### 数据集结构
本数据集提供两种通用格式,以适配多样化的大语言模型训练与评估流程:
1. EvalScope格式
每个样本包含以下字段:
- query:材料科学相关问题
- response:经专家验证的标准答案
3. Alpaca格式
每个样本包含以下字段:
- messages:一条包含两条对话消息的列表
- 第一条消息:角色为"user",内容为待解答的问题
- 第二条消息:角色为"assistant",内容为对应答案
### 预期用途
- 基准测试大语言模型在材料科学领域的知识生成、抽取、摘要与推理任务性能
- 面向材料科学领域的专用大语言模型进行微调或指令微调(若用于训练,请确保遵守相关许可协议)
### 基准模型性能
本数据集用于评估多款近期大语言模型,得到以下定量评测结果(batch_size=8):
| 模型 | Rouge-1召回率 | Rouge-1精确率 | Rouge-1F1值 | Rouge-2召回率 | Rouge-2精确率 | Rouge-2F1值 | Rouge-L召回率 | Rouge-L精确率 | Rouge-LF1值 | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| Deepseek-R1 | 0.6238 | 0.1955 | 0.2919 | 0.2375 | 0.0482 | 0.0783 | 0.4343 | 0.0675 | 0.1140 | 0.1011 | 0.0394 | 0.0167 | 0.0079 |
| Deepseek-V3 | 0.5452 | 0.2670 | 0.3509 | 0.2031 | 0.0793 | 0.1106 | 0.3749 | 0.1191 | 0.1748 | 0.1965 | 0.0772 | 0.0373 | 0.0196 |
| Qwen3-4B | 0.4099 | 0.2749 | 0.3203 | 0.1651 | 0.0820 | 0.1056 | 0.3299 | 0.1359 | 0.1849 | 0.2271 | 0.0806 | 0.036 | 0.0196 |
| Qwen3-0.6B | 0.3840 | 0.2722 | 0.3095 | 0.1596 | 0.0835 | 0.1054 | 0.3302 | 0.1332 | 0.1825 | 0.2191 | 0.0824 | 0.0416 | 0.0251 |
### 许可协议
请务必遵守来源文献的使用权限要求。本数据集采用CC BY 4.0协议发布。
### 免责声明
本数据集按“现状”提供,不附带任何关于准确性、完整性或特定用途适用性的保证。尽管问答对已尽最大可能由领域专家验证,用户在将本数据集衍生的关键信息用于学术、工业或安全关键型应用前,仍应独立进行验证。作者及维护者不对内容中的任何错误或遗漏,或因使用本数据集导致的任何后果承担责任或义务。
提供机构:
maas
创建时间:
2025-07-02



