BiGGen-Bench
收藏arXiv2024-06-09 更新2024-06-12 收录
下载链接:
https://huggingface.co/datasets/prometheus-eval/BiGGen-Bench
下载链接
链接失效反馈官方服务:
资源简介:
BiGGen-Bench是由韩国科学技术院和LG AI Research等机构合作创建的一个综合性语言模型评估数据集,旨在通过77个多样化的任务评估语言模型的九大核心能力,包括指令遵循、基础、规划、推理、精炼、安全、心智理论、工具使用和多语言能力。该数据集包含765个实例,每个实例都有其特定的细粒度评估标准,以确保评估的精确性和全面性。创建过程中采用了人机交互的方法,确保数据集的质量和适用性。BiGGen-Bench的应用领域广泛,主要用于语言模型的性能评估和改进,特别是在需要高度精确和细致评估的场景中。
BiGGen-Bench is a comprehensive language model evaluation dataset co-created by institutions including the Korea Advanced Institute of Science and Technology (KAIST) and LG AI Research. It aims to evaluate nine core capabilities of language models through 77 diverse tasks, including instruction following, foundational capabilities, planning, reasoning, refinement, safety, theory of mind, tool use, and multilingual capabilities. The dataset consists of 765 instances, each with specific fine-grained evaluation criteria to ensure the accuracy and comprehensiveness of the evaluation. A human-machine interactive approach was adopted during the creation process to guarantee the dataset's quality and applicability. BiGGen-Bench has broad application scenarios, and is primarily used for performance evaluation and improvement of language models, especially in scenarios requiring highly precise and meticulous evaluations.
提供机构:
韩国科学技术院
创建时间:
2024-06-09



