bofenghuang/mt-bench-french

Name: bofenghuang/mt-bench-french
Creator: bofenghuang
Published: 2024-07-20 19:57:18
License: 暂无描述

Hugging Face2024-07-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/bofenghuang/mt-bench-french

下载链接

链接失效反馈

官方服务：

资源简介：

MT-Bench-French是一个法语版本的数据集，旨在评估大型语言模型（LLMs）在法语环境中的多轮对话和指令跟随能力。该数据集包含80个高质量的多轮问题，涵盖八个主要类别。所有问题都经过翻译和人工审核，以确保用词准确、内容有意义，并且在同一对话中的问题之间保持连贯性。对于某些复杂任务（如数学、推理和编码），数据集中包含了参考答案，以帮助评估LLMs的响应。这些参考答案由GPT-4生成，并经过人工审核和修正，以减少评估中的偏差。尽管该数据集为评估LLMs提供了便利，但由于数据集和方法的内在限制，不应将其视为评估的终极基准。

MT-Bench-French is a French version of the MT-Bench dataset, designed to evaluate the multi-turn conversation and instruction-following capabilities of large language models (LLMs) in the French language. The dataset comprises 80 high-quality, multi-turn questions spanning eight main categories. All questions have undergone translation into French and thorough human review to ensure the use of appropriate and authentic wording, meaningful content for assessing LLMs capabilities, and coherence between questions within the same conversation. For certain challenging tasks (e.g., math, reasoning, and coding), a reference answer is included in the judge prompt to assist in evaluating responses from LLMs. These reference answers are generated by the LLM judge (GPT-4) and have been reviewed and corrected by humans to mitigate bias in evaluation. Although this dataset provides a convenient way to evaluate LLMs, it should not be regarded as the ultimate benchmark for such assessments due to the inherent limitations of both the dataset and the methodology.

提供机构：

bofenghuang

原始信息汇总

数据集概述

数据集名称

MT-Bench-French

数据集描述

MT-Bench-French 是一个法语版本的评估数据集，用于评估大型语言模型（LLMs）的多轮对话和指令跟随能力。该数据集包含80个高质量的多轮问题，覆盖8个主要类别。

数据集特点

所有问题均已翻译成法语，并经过彻底的人工审查，确保使用适当的和真实的词汇，以及对LLMs在法语中的能力评估有意义的内容。
对于某些挑战性任务（如数学、推理和编程），包含由LLM（GPT-4）生成的参考答案，并由人工进行了额外的审查和修正。

数据集结构

数据集包含一个默认配置，其中测试数据文件为 question.jsonl。

数据集使用注意事项

虽然该数据集提供了一种方便的方式来评估LLMs，但不应被视为此类评估的最终基准，考虑到数据集和方法的固有限制。

数据集语言

法语（fr）

数据集许可证

Apache-2.0

数据集任务类别

问答（question-answering）

数据集标签

评估（evaluation）

数据集大小类别

小于1K（n<1K）

5,000+

优质数据集

54 个

任务类型

进入经典数据集