MatSci-NLP

arXiv2025-09-30 收录

下载链接：

https://github.com/BangLab-UdeM-Mila/NLP4MatSci-HoneyBee

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个涵盖材料科学自然语言处理任务的广泛基准，包含了与材料科学相关的各种文本数据，不仅限于燃料电池、无机材料、玻璃和超导体等领域。数据被划分为1%的训练子集和99%的测试子集以进行评估。对HoneyBee模型的评估基于宏观F1分数和微观F1分数。该数据集的规模广泛，涵盖了多种NLP任务和材料类型，其任务是评估语言模型在各类材料科学NLP任务上的表现。

This dataset is a comprehensive benchmark for natural language processing (NLP) tasks in materials science, encompassing diverse text data related to materials science that spans multiple domains including but not limited to fuel cells, inorganic materials, glass, and superconductors. The dataset is split into a 1% training subset and a 99% test subset for model evaluation. Evaluation of the HoneyBee model is conducted using macro-F1 and micro-F1 scores. With a large scale, this dataset covers a wide range of NLP tasks and material categories, and its primary goal is to assess the performance of language models across various materials science NLP tasks.

搜集汇总

数据集介绍

背景与挑战

背景概述

NLP4MatSci-HoneyBee是一个材料科学领域的大型语言模型指令微调数据集，基于EMNLP 2023论文'HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science'构建。该数据集包含用于训练和测试的指令数据，以及相关的代码实现，支持材料科学任务中的语言模型应用。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集