LabSafety Bench

Name: LabSafety Bench
Creator: 圣母大学
Published: 2024-10-18 13:21:05
License: 暂无描述

arXiv2024-10-18 更新2024-10-22 收录

下载链接：

https://yujunzhou.github.io/LabSafetyBench.github.io/

下载链接

链接失效反馈

官方服务：

资源简介：

LabSafety Bench是由圣母大学和IBM研究团队创建的一个实验室安全评估框架，旨在评估大型语言模型（LLMs）在实验室安全环境中的可靠性。数据集包含765个多选题，涵盖了实验室安全的四个主要领域：危险物质、应急响应、责任与合规、设备与材料处理。这些问题由人类专家验证，确保其准确性和清晰性。数据集的创建过程包括提出新的实验室安全分类法、收集相关材料、生成问题并由GPT-4o协助优化选项，最后由专家审核。该数据集主要用于评估LLMs在实验室安全决策中的应用，旨在提高实验室环境中的安全性和可靠性。

LabSafety Bench is a laboratory safety evaluation framework developed by teams from the University of Notre Dame and IBM Research, which aims to assess the reliability of Large Language Models (LLMs) in laboratory safety contexts. The dataset contains 765 multiple-choice questions spanning four core domains of laboratory safety: hazardous substances, emergency response, responsibility and compliance, and equipment and material handling. All questions have been verified by human experts to ensure their accuracy and clarity. The dataset creation process includes proposing a novel laboratory safety taxonomy, collecting relevant materials, generating questions, optimizing the options with the assistance of GPT-4o, and finally conducting expert reviews. This dataset is primarily used to evaluate the application of LLMs in laboratory safety decision-making, with the goal of enhancing safety and reliability in laboratory environments.

提供机构：

圣母大学

创建时间：

2024-10-18

搜集汇总

数据集介绍

构建方式

LabSafety Bench 数据集的构建基于与美国职业安全与健康管理局（OSHA）协议对齐的新分类法。该分类法将实验室安全问题分为四个主要类别：危险物质、应急响应、责任与合规以及设备与材料处理。研究团队根据这一分类法，精心设计了 765 道多选题，其中 632 道为纯文本问题，133 道为图文结合问题。每道题目均由人类专家验证，确保其准确性和清晰性。

特点

LabSafety Bench 数据集的特点在于其全面性和专业性。数据集涵盖了实验室安全领域的多个方面，包括但不限于生物危害、化学危害、辐射危害和物理危害。此外，数据集中的问题难度分为“简单”和“困难”两类，确保了评估的全面性。数据集还提供了详细的推理步骤，帮助用户理解每个问题的答案。

使用方法

LabSafety Bench 数据集主要用于评估大型语言模型（LLMs）和视觉语言模型（VLMs）在实验室安全环境中的表现。用户可以通过该数据集测试模型在处理实验室安全问题时的准确性和可靠性。数据集的评估方法包括链式思维（CoT）和外部提示的使用，以及零样本和五样本学习设置。此外，数据集还提供了人类评估者的表现数据，以便进行模型性能的对比分析。

背景与挑战

背景概述

LabSafety Bench is a pioneering benchmark designed to evaluate the reliability of Large Language Models (LLMs) in addressing safety issues within scientific laboratories. Developed by researchers from the University of Notre Dame and IBM Research, this benchmark was introduced to fill a critical gap in the assessment of LLMs' trustworthiness, particularly in safety-critical environments. The core research question revolves around the ability of LLMs to provide safe and accurate guidance in laboratory settings, where human lives and property are at risk due to potential laboratory accidents. LabSafety Bench comprises 765 multiple-choice questions, verified by human experts, and is aligned with Occupational Safety and Health Administration (OSHA) protocols. The benchmark's creation underscores the growing reliance on LLMs in various fields, including laboratory settings, and the need for specialized evaluations to ensure their reliability in real-world safety applications.

当前挑战

The primary challenge associated with LabSafety Bench is the lack of adequate evaluation data specifically tailored for assessing LLMs' trustworthiness in laboratory safety contexts. While extensive corpora of lab safety protocols exist, these are mostly statements and are likely part of the training data for LLMs. Existing benchmarks in general scientific domains primarily evaluate LLMs' reasoning abilities and domain knowledge understanding but fail to account for the practical challenges LLMs might face when handling lab-related safety issues in the physical world. For instance, a new lab technician who is unfamiliar with lab safety might ask an LLM for guidance and follow the seemingly correct steps. However, if the LLM's response omits a critical safety procedure, it could lead to serious consequences. Similarly, in a self-driving lab, an LLM-generated instruction might lack a safety-critical step, resulting in equipment malfunction, chemical spills, or even explosions. Addressing these challenges requires a specialized evaluation framework that accurately assesses the trustworthiness of LLMs in real-world safety applications.

常用场景

经典使用场景

LabSafety Bench 数据集的经典使用场景在于评估大型语言模型（LLMs）和视觉语言模型（VLMs）在实验室安全环境中的表现。通过包含765个多选题的全面评估框架，该数据集能够系统地测试模型在实验室安全上下文中的可靠性。这些测试涵盖了从危险物质处理到紧急响应等多个方面，确保模型在面对实验室安全问题时能够提供准确且安全的指导。

解决学术问题

LabSafety Bench 数据集解决了当前学术研究中一个重要的空白，即缺乏对LLMs在实验室安全应用中可靠性的系统评估。通过提供一个与职业安全与健康管理局（OSHA）协议对齐的新分类法和专家验证的问题，该数据集能够准确评估模型在实际安全应用中的可信度。这不仅有助于识别模型在安全关键环境中的潜在风险，还强调了开发专门基准以准确评估LLMs在现实世界安全应用中的必要性。

衍生相关工作

LabSafety Bench 数据集的引入催生了一系列相关工作，特别是在LLMs和VLMs在实验室安全领域的应用研究。例如，一些研究已经开始探索如何通过知识蒸馏和检索增强生成（RAG）方法来进一步提升开源模型的能力。此外，该数据集还促进了在宪法AI框架下对模型进行微调，以增强其在实验室安全知识上的表现。这些衍生工作不仅扩展了LLMs的应用范围，还提升了其在高风险环境中的安全性。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集