kifai/KoInFoBench

Name: kifai/KoInFoBench
Creator: kifai
Published: 2024-05-18 10:16:59
License: 暂无描述

Hugging Face2024-05-18 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/kifai/KoInFoBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - ko size_categories: - n<1K --- # KoInFoBench KoInFoBench is a specialized evaluation dataset designed to assess the performance of Large Language Models (LLMs) on capabilities of Korean instructions following.<br> The current version of `KoInFoBench` consists of 60 instruction sets and 233 questions. Inspired by [InFoBench](https://huggingface.co/datasets/kqsong/InFoBench) dataset, we extends their concpet by focusing on the nuances and features of Korean language. - 🖥️ Code to reproduce or evaluate own LLMs is available at [https://github.com/KIFAI/KoInFoBench](https://github.com/KIFAI/KoInFoBench) - 📄 Paper is under writing and open soon! ### 🚀 Update - **2024.05.18**: add other results `gpt-4o-2024-05-13`, `claude-3-sonnet-20240229`, `solar-1-mini-chat` ## Dataset Overview ### Usage ```python from datasets import load_dataset dataset = load_dataset('kifai/KoInFoBench') ``` ### Example ```json { "id": "19", "subset": "input_intensive_set", "category": "구글캘린더", "instruction": "다음은 해외 콘서트 참가 확정에 대한 영문으로 작성된 이메일입니다. 한국시간(KST) 기준으로 참가 확정된 날짜, 콘서트 날짜와 시간을 \"년-월-일 시간\" 형식으로 작성하고 한국시간 기준으로 참가 확정일로부터 콘서트 날짜까지 몇 일 남았는지 계산하여 국문으로 정답을 함께 작성합니다.", "input": "Email: We are pleased to inform you that your concert ticket purchase has been successfully confirmed at approximately 11am GMT today (26 March 2024). The concert you have been eagerly awaiting is scheduled to take place on 17 September 2024, starting at 6 PM UTC+2. Please mark your calendar and prepare to join us for an unforgettable evening of live music and entertainment. Your ticket grants you access to a night filled with exceptional performances, engaging visuals, and the vibrant energy of live music. We recommend arriving early to enjoy the full experience, including pre-concert activities and amenities.", "decomposed_questions": [ "답변은 해외 콘서트 참가 일정에 대한 내용이 포함되어 있습니까?", "답변으로 작성된 모든 일정은 한국시간(KST) 기준으로 작성되었습니까?", "콘서트 참가가 확정된 날짜 그리고 콘서트 날짜와 시간 2개의 일정을 모두 포함합니까?", "날짜와 시간이 \"년-월-일 시간\" 형식으로 올바르게 작성되었습니까?", "콘서트 확정일로부터 콘서트까지 남은 기간은 콘서트 시작일을 포함할 경우 177일, 미포함인 경우 176일입니다. 남은 기간을 176일 혹은 177일로 계산하였습니까?" ], "question_label": [ "Format", "Format, Content", "Format", "Format", "Number" ], "ref": "" } ``` ### Fields - **id**: unique identifier for each entry in the dataset - **subset**: include `input_intensive_set` and `instruction_intensive_set`. where "intensive" indicates the entry's focus on evaluating Korean specific input or detailed instruction following - **category**: a string which each entry belongs. For example, '구글캘린더' indicates that the entry is related to tasks associated with Google Calander - **instruction**: a string containing instructions - **input**: a string containing context information and can be empty - **decomposed_questions**: a list of string questions that decompose the task related to the entry. Each question is designed to evaluate the response of LLM - **question_label**: a list of string labels that identify the type of each decomposed question. Each lable belong to multiple aspects, such as Format, Content, Number, Linguistic, Style - **ref**: references a string for references or additional information and it could be empty ## Evaluation Result ### DRFR Decomposed Requirements Following Ratio(DRFR) is the metric to evaluate how LLMs accurately respond to the instruction/input. This metric calculates the average accuracy across answers to the decomposed questions for each instruction. The following is the summary of the model performance on our dataset. | Model | H_DRFR | A_DRFR | Alignment | |------------------------------ |-------- |--------|-----------| | **claude-3-opus-20240229** | **0.854** | 0.850 | 87% | | **gpt-4-turbo-2024-04-09** | 0.850 | 0.880 | 87% | | **gpt-4o-2024-05-13** | 0.850 | 0.863 | 89% | | **gpt-4-0125-preview** | 0.824 | 0.824 | 83% | | **claude-3-sonnet-20240229** | 0.790 | 0.828 | 84% | | **gemini-1.5-pro** | 0.773 | 0.811 | 83% | | **meta-llama/Meta-Llama-3-70B-Instruct-** | 0.747 | 0.863 | 84% | | **hpx003** | 0.691 | 0.738 | 83% | | **gpt-3.5-turbo-0125** | 0.678 | 0.734 | 82% | | **solar-1-mini-chat** | 0.614 | 0.695 | 79% | | **yanolja/EEVE-Korean-Instruct-10.8B-v1.0** | 0.597 | 0.730 | 79% |` - `H_DRFR`: The accuracy of model responses as evaluated by the human expert - `A_DRFR`: The accuracy of model responses automatically evaluated by GPT-4 as employing the capability of LLM-as-a-judge - `Alignment`: The degree of agreement or consistency between the human and automated evaluation > Please note that the evaluation results of the LLMs presented in the above table may vary due to its randomness. ## Additional Information ### License Information This dataset is released under the [MIT LISENCE](https://github.com/KIFAI/KoInfoBench/blob/main/LICENSE) ### Citation Information ``` @article{, title={KoInFoBench}, author={Sungwoo Oh, Sungjun Kown, Donggyu Kim}, year={2024}, eprint={}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

kifai

原始信息汇总

数据集概述

数据集名称

名称: KoInFoBench
描述: 专门设计用于评估大型语言模型（LLMs）在遵循韩语指令的能力上的表现。

数据集内容

版本: 包含60个指令集和233个问题。
语言: 韩语（ko）
任务类别: 文本生成（text-generation）
大小类别: 小于1K（n<1K）

数据集结构

字段:
- id: 数据集中每个条目的唯一标识符。
- subset: 包括input_intensive_set和instruction_intensive_set，分别表示评估重点在于韩语特定输入或详细指令遵循。
- category: 条目所属的类别，如구글캘린더表示与Google Calendar相关的任务。
- instruction: 包含指令的字符串。
- input: 包含上下文信息的字符串，可能为空。
- decomposed_questions: 分解任务相关问题的字符串列表，用于评估LLM的响应。
- question_label: 标识分解问题类型的字符串标签列表。
- ref: 参考或附加信息的字符串，可能为空。

使用示例

python from datasets import load_dataset

dataset = load_dataset(kifai/KoInFoBench)

评估结果

评估指标: Decomposed Requirements Following Ratio (DRFR)，用于评估LLMs对指令/输入的响应准确性。
模型性能总结: 提供多个模型的DRFR评估结果，包括人工评估和自动评估的准确率及两者的一致性。

许可证信息

许可证: MIT

引用信息

@article{, title={KoInFoBench}, author={Sungwoo Oh, Sungjun Kown, Donggyu Kim}, year={2024}, eprint={}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集