VAGOsolutions/MT-Bench-TrueGerman

Name: VAGOsolutions/MT-Bench-TrueGerman
Creator: VAGOsolutions
Published: 2023-10-12 10:07:55
License: 暂无描述

Hugging Face2023-10-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/VAGOsolutions/MT-Bench-TrueGerman

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - de --- ## Benchmark **German Benchmarks on Hugging Face** At present, there is a notable scarcity, if not a complete **absence, of reliable and true German benchmarks** designed to evaluate the capabilities of German Language Models (LLMs). While some efforts have been made to translate English benchmarks into German, these attempts often **fall short in terms of precision, accuracy, and context sensitivity, even when employing GPT-4 technology**. Take, for instance, the **MT-Bench**, a widely recognized and frequently used benchmark for assessing LLM performance in real-world scenarios. The seemingly straightforward and cost-effective approach of **translating MT-Bench into German using GPT-4 proves to be counterproductive**, resulting in subpar outcomes that hinder a realistic and contextually appropriate evaluation of German LLMs. To illustrate this, we offer a few examples extracted from translated MT-Bench versions available on Hugging Face. **Example: Uncommon use of words** *{ "category": "writing", "turns": [ "Schreibe eine überzeugende E-Mail, um deinen introvertierten Freund, der öffentliches Sprechen nicht mag, dazu zu bringen, sich als Gastredner bei einer lokalen Veranstaltung zu engagieren. Verwende überzeugende Argumente und gehe auf mögliche Einwände ein. Bitte sei prägnant.", "Kannst du deine vorherige Antwort umformulieren und in jedem Satz eine Metapher oder ein **Gleichnis** einbauen?" ] }* What you can see here is an example of a German word, someone would not use in a real conversation (marked in bold). In a real conversation someone would rather use “Vergleich” instead of “Gleichnis”. **Example: Wrong context** *{ "category": "roleplay", "turns": [ "Bitte nehmen Sie die Rolle eines englischen Übersetzers an, der damit beauftragt ist, Rechtschreibung und Sprache zu korrigieren und zu verbessern. Unabhängig von der Sprache, die ich verwende, sollten Sie sie identifizieren, übersetzen und mit einer verfeinerten und polierten Version meines Textes **auf Englisch antworten**.* Here we get a request to translate a given sentence in English language and phrase a more sophisticated sentence compared to the original sentence. As we aim to assess a German LLM requesting the model to translate a sentence in English language would be pointless. **Example: Wrong content** *{"category": "writing", "turns": [ "Bearbeite den folgenden Absatz, um etwaige grammatikalische Fehler zu korrigieren: ***Sie erinnerte sich nicht daran, wo ihre Geldbörse ist, also denke ich, dass sie im Auto ist, aber er sagt, dass sie auf dem Küchentisch ist, aber er ist sich nicht sicher, und dann haben sie mich gebeten, danach zu suchen, sie sagt: "Kannst du?", und ich antworte: "Vielleicht, aber ich bin nicht sicher", und er hat mich nicht gehört, und er fragt: "Was?", "Hast du es gefunden?"***.", "Ändere deine frühere Antwort und vermeide die Verwendung von geschlechtsspezifischen Pronomen." ]}* The task here is to edit a sentence full of grammatical errors and correct them. The problem with this translated version of the MT-bench is that the sentence was already corrected by GPT4 during translation. So now the model is requested to correct a sentence that has no more grammatical errors. **Example: Pointless translation of anglicisms** *{ "category": "roleplay", "turns": [ "Jetzt bist du ein **Maschinenlern-Ingenieur**. Deine Aufgabe besteht darin, komplexe Maschinenlernkonzepte auf einfache Weise zu erklären, damit Kunden ohne technischen Hintergrund deine Produkte verstehen und ihnen vertrauen können. Fangen wir an mit der Frage: Was ist ein Sprachmodell? Wird es mit gelabelten oder ungelabelten Daten trainiert?, "Ist das wahr? Ich habe gehört, dass andere Unternehmen unterschiedliche Ansätze verwenden, um dies zu tun und es sicherer zu machen.]}* As we can see here, the GPT4 translation of this dataset lead to a term that no one would use when speaking German. Instead someone would rather use the original English term “Machine Learning Engineer” or the properly translated term “Ingenieur für maschinelles Lernen”. **Our approach to a German Benchmark** So, what we did instead of simply translating the MT-Bench with GPT4, we applied a mixed approach of automatic translation and human evaluation. In a first step we translated the complete MT-Bench into German language by using GPT4. In a second step we conducted a thorough manual evaluation of each translated dataset to ensure following quality criteria: - The dataset has been translated into German language. - The German translation consists of an appropriate and genuine wording. - the context of the translated dataset is meaningful and reasonable for assessing German language skills of the model. - the content of the translated dataset is still reasonable after translation. Although this method is undeniably time-consuming, it enables us to create a substantive benchmark for evaluating the model's proficiency in completing various benchmark categories. Nonetheless, it is important to acknowledge that even with this meticulous approach, a truly flawless benchmark remains elusive, as minor oversights may still occur due to human errors. Nevertheless, when we compare the current approaches of German Language Model teams available on Hugging Face, we may assume that our German MT-Bench, as of today, stands as the most precise and practical benchmark for assessing German LLMs. Consequently, the benchmark scores we present offer a realistic evaluation of the models performance in German language.

This dataset aims to provide a precise and practical benchmark for German Language Models (LLMs) by combining automatic translation and human evaluation methods, ensuring that the translated datasets have appropriate wording, meaningful context, and reasonable content in German.

提供机构：

VAGOsolutions

原始信息汇总

数据集概述

数据集背景

目前，针对德语语言模型（LLMs）的可靠和真实的德语基准测试非常稀缺，甚至完全缺失。尽管有些尝试将英语基准测试翻译成德语，但这些尝试往往在精确性、准确性和上下文敏感性方面存在不足，即使使用GPT-4技术也是如此。

存在的问题

词汇使用不当：例如，在实际对话中不会使用的德语词汇（如“Gleichnis”）。
上下文错误：请求模型翻译英语句子，这对于评估德语LLM是无意义的。
内容错误：翻译后的句子在翻译过程中已被GPT-4修正，导致模型被要求修正一个没有语法错误的句子。
无意义的翻译：GPT-4的翻译导致了一些在德语中不会使用的术语（如“Maschinenlern-Ingenieur”）。

数据集构建方法

采用自动翻译和人工评估的混合方法：

使用GPT-4将完整的MT-Bench翻译成德语。
对每个翻译后的数据集进行彻底的手动评估，确保以下质量标准：
- 数据集已翻译成德语。
- 德语翻译包含适当和真实的词汇。
- 翻译后的数据集上下文有意义且合理，适合评估模型的德语语言技能。
- 翻译后的数据集内容在翻译后仍然合理。

数据集优势

尽管这种方法非常耗时，但它使我们能够创建一个实质性的基准测试，用于评估模型在完成各种基准类别方面的熟练程度。尽管这种方法无法完全避免人为错误，但与其他在Hugging Face上可用的德语语言模型团队的方法相比，我们的德语MT-Bench是目前最精确和实用的基准测试，用于评估德语LLMs的性能。

搜集汇总

数据集介绍

背景与挑战

背景概述

这是一个德语语言模型基准测试数据集，旨在通过混合自动翻译和人工评估方法提供高质量的德语评估基准，但数据集存在列不匹配错误，导致数据生成失败，影响了其完整性和可用性。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集