lavita/imapScore

Name: lavita/imapScore
Creator: lavita
Published: 2024-08-15 05:08:29
License: 暂无描述

Hugging Face2024-08-15 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/lavita/imapScore

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Question dtype: string - name: Doctor_Response dtype: string - name: llm_prediction dtype: string - name: Average_accuracy_score dtype: float64 - name: Average_completeness_score dtype: float64 - name: Average_specific_scores dtype: float64 splits: - name: HMedQA_GPT_4 num_bytes: 232662 num_examples: 200 - name: HMedQA_PaLM_2 num_bytes: 245825 num_examples: 199 - name: iCliniq_ChatGPT num_bytes: 295077 num_examples: 200 - name: iCliniq_PaLM_2 num_bytes: 400676 num_examples: 200 download_size: 663724 dataset_size: 1174240 task_categories: - question-answering language: - en - zh tags: - medical size_categories: - 1K<n<10K --- # Dataset Card for "imapScore" This dataset is the converted version of [imapScore](https://github.com/HathyHuimin/imapScore) benchmark. Some notes about the data: * To unify the benchmark scheme, the following columns of the subset datasets in the original benchmark are renamed to `llm_prediction`: * `palm2_prediction` in the `HMedQA-PaLM-2` and `iCliniq-PaLM-2` datasets * `gpt4_prediction` in the `HMedQA-GPT-4` dataset * `chatgpt_prediction` in the `iCliniq-ChatGPT` dataset ## Reference If you use imapScore, please cite the original paper: ``` @inproceedings{wang-etal-2024-imapscore, title = "imap{S}core: Medical Fact Evaluation Made Easy", author = "Wang, Huimin and Zhao, Yutian and Wu, Xian and Zheng, Yefeng", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Findings of the Association for Computational Linguistics ACL 2024", month = aug, year = "2024", address = "Bangkok, Thailand and virtual meeting", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-acl.610", pages = "10242--10257", abstract = "Automatic evaluation of natural language generation (NLG) tasks has gained extensive research interests, since it can rapidly assess the performance of large language models (LLMs). However, automatic NLG evaluation struggles with medical QA because it fails to focus on the crucial correctness of medical facts throughout the generated text. To address this, this paper introduces a new data structure, \textit{imap}, designed to capture key information in questions and answers, enabling evaluators to focus on essential details. The \textit{imap} comprises three components: Query, Constraint, and Inform, each of which is in the form of term-value pairs to represent medical facts in a structural manner. We then introduce \textit{imap}Score, which compares the corresponding medical term-value pairs in the \textit{imap} to score generated texts. We utilize GPT-4 to extract \textit{imap} from questions, human-annotated answers, and generated responses. To mitigate the diversity in medical terminology for fair term-value pairs comparison, we use a medical knowledge graph to assist GPT-4 in determining matches. To compare \textit{imap}Score with existing NLG metrics, we establish a new benchmark dataset. The experimental results show that \textit{imap}Score consistently outperforms state-of-the-art metrics, demonstrating an average improvement of 79.8{\%} in correlation with human scores. Furthermore, incorporating \textit{imap} into n-gram, embedding, and LLM metrics boosts the base versions, increasing correlation with human scores by averages of 89.9{\%}, 81.7{\%}, and 32.6{\%}, respectively.", } ```

提供机构：

lavita

5,000+

优质数据集

54 个

任务类型

进入经典数据集