Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)

Name: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)
Creator: Upstage AI
Published: 2024-10-07 15:14:37
License: 暂无描述

arXiv2024-10-07 更新2024-10-10 收录

下载链接：

https://github.com/UpstageAI/ThaiCLI_H6

下载链接

链接失效反馈

官方服务：

资源简介：

Thai-H6和Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)是由Upstage AI和朱拉隆功大学共同创建的两个关键基准数据集，旨在推动泰国大型语言模型（LLMs）的研究。Thai-H6是六个国际公认基准的本地化适应，用于评估LLMs的核心能力，如推理、知识和常识。ThaiCLI则专注于评估LLMs对泰国社会和文化规范的理解。这两个数据集的创建过程包括多轮人工审查和翻译调整，确保数据集在语言和文化上的准确性。这些数据集的应用领域主要集中在提升泰国LLMs的文化理解和核心能力，以解决当前泰国LLMs在文化理解和常识推理方面的不足。

Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI) are two pivotal benchmark datasets co-created by Upstage AI and Chulalongkorn University, aiming to advance research on Thai large language models (LLMs). Thai-H6 is a localized adaptation of six internationally recognized benchmarks designed to evaluate the core capabilities of LLMs, including reasoning, factual knowledge and commonsense. In contrast, ThaiCLI focuses on assessing LLMs' understanding of Thai social and cultural norms. The development of these two datasets involves multiple rounds of manual review and translation adjustments to ensure their linguistic and cultural accuracy. These datasets are primarily intended to enhance the cultural comprehension and core capabilities of Thai LLMs, addressing the current shortcomings in cultural understanding and commonsense reasoning of existing Thai LLMs.

提供机构：

Upstage AI

创建时间：

2024-10-07

原始信息汇总

ThaiCLI and Thai-H6 Benchmarks

1. Introduction

Purpose: Advance the development and evaluation of large language models (LLMs) in Thai.
Benchmarks:
- ThaiCLI: Evaluates cultural intelligence.
- Thai-H6: Assesses core language capabilities.

2. ThaiCLI

2-1. Benchmark Overview

Objective: Evaluate LLMs comprehension of cultural and societal norms in Thailand.
Question Formats:
- Factoid: Conversational questions related to daily life.
- Instruction: Culturally-contextualized tasks requiring specific instructions.
Themes: Royal Family, Religion, Culture, Economy, Humanity, Lifestyle, Politics.
Sample Counts:
- Factoid: 1790 samples.
- Instruction: 100 samples.

2-2. Details about ThaiCLI

Question Formats:
- Factoid Questions: Conversational questions.
- Instruction Tasks: Culturally-contextualized tasks.
Answer Types:
- Chosen Answers: Reflect cultural sensitivity.
- Rejected Answers: Lack cultural awareness.

2-3. Evaluation Results

Models Evaluated: GPT-4o, GPT-4 Turbo, GPT-4o Mini, GPT-3.5 Turbo, Gemini Pro, Claude Sonnet, Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B-Instruct, Qwen2-72B-Instruct, Llama-3-Typhoon-v1.5x-70b-Instruct, Sailor-14B-Chat, SeaLLMs-v3-7B-Chat.
Performance Metrics: ThaiCLI (Avg.), Factoid, Instruction.

3. Thai-H6

3-1. Benchmark Overview

Objective: Evaluate core capabilities of LLMs in Thai.
Datasets: th-ARC, th-HellaSwag, th-MMLU, th-TruthfulQA, th-GSM8K, th-Winogrande.
Sample Counts:
- th-ARC: 1,222 samples.
- th-HellaSwag: 10,052 samples.
- th-MMLU: 14,585 samples.
- th-TruthfulQA: 817 samples.
- th-GSM8K: 1,324 samples.
- th-Winogrande: 1,272 samples.

3-2. Evaluation Results

Models Evaluated: Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B-Instruct, Qwen2-72B-Instruct, Llama-3-Typhoon-v1.5x-70b-Instruct, Sailor-14B-Chat, SeaLLMs-v3-7B-Chat.
Performance Metrics: Thai-H6 (Avg.), th-ARC, th-HellaSwag, th-MMLU, th-TruthfulQA, th-Winogrande, th-GSM8K.

4. Citation

bibtex @misc{kim2024representingunderrepresentedculturalcore, title={Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models}, author={Dahyun Kim and Sukyung Lee and Yungi Kim and Attapol Rutherford and Chanjun Park}, year={2024}, eprint={2410.04795}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.04795}, }

搜集汇总

数据集介绍

构建方式

Thai-H6和Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)数据集的构建旨在填补泰语大型语言模型（LLMs）评估框架的空白。Thai-H6通过将国际公认的六个基准测试（如AI2 Reasoning Challenge、Massive Multitask Language Understanding等）翻译成泰语，并经过专家验证，确保语言和上下文的准确性。ThaiCLI则通过设计包含问题、选择答案和拒绝答案的三元组，评估模型对泰国社会和文化规范的理解。这两个数据集的构建过程均强调了多轮人工审查，以确保数据质量和对泰国文化的准确反映。

特点

Thai-H6数据集的特点在于其全面性，涵盖了从一般推理到领域特定知识和数学推理的广泛任务，确保了对LLMs核心能力的深入评估。ThaiCLI数据集则专注于评估模型对泰国文化和社会规范的理解，通过对比选择和拒绝答案，突显了模型在文化敏感性方面的表现。这两个数据集的结合，为评估泰语LLMs提供了全面且细致的框架。

使用方法

使用Thai-H6和ThaiCLI数据集时，研究人员可以通过对模型在这些数据集上的表现进行评估，来衡量其在泰语处理中的核心能力和文化理解。Thai-H6适用于评估模型的推理、知识和常识能力，而ThaiCLI则用于评估模型在处理泰国文化和社会规范时的表现。通过结合这两个数据集的评估结果，可以全面了解模型在泰语环境中的适用性和准确性。

背景与挑战

背景概述

随着大型语言模型（LLMs）的快速发展，评估其核心能力的需求日益凸显，如推理、知识和常识等，从而催生了诸如H6基准等广泛使用的评估套件。然而，这些基准主要针对英语语言构建，对于如泰语等在LLM开发中代表性不足的语言，相应的评估资源匮乏。此外，开发泰语LLMs不仅需要提升语言理解能力，还需增强对泰国文化的理解。为应对这一双重挑战，我们提出了两个关键基准：Thai-H6和泰国文化与语言智能基准（ThaiCLI）。通过全面评估具有多语言能力的各种LLMs，我们提供了对这些基准的深入分析，并展示了它们对泰语LLM开发的贡献。

当前挑战

构建Thai-H6和ThaiCLI数据集面临多重挑战。首先，现有的泰语基准主要集中在传统的NLP任务上，如分词和命名实体识别，缺乏对LLM更广泛能力的评估。其次，泰语LLMs不仅需要核心语言能力的评估，还需考虑文化敏感性和社会规范的复杂性。此外，泰语与邻国的关系复杂，语言中蕴含的文化偏见难以完全捕捉。最后，现有的评估资源往往缺乏深度，无法充分评估模型的文化理解能力。这些挑战要求我们在构建数据集时，需进行多轮的人工校验和文化对齐，以确保评估的全面性和准确性。

常用场景

经典使用场景

Thai-H6和Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)数据集的经典使用场景主要集中在评估和提升泰语大型语言模型（LLMs）的核心能力。Thai-H6通过本地化的六项国际公认基准测试，评估LLMs在泰语环境中的推理、知识和常识能力。而ThaiCLI则专注于评估LLMs对泰国社会和文化规范的理解，通过三元组问题、选择和拒绝的回答来判断模型输出是否符合泰国文化标准。

实际应用

在实际应用中，Thai-H6和ThaiCLI数据集被广泛用于开发和优化泰语LLMs，以确保其在各种任务中的表现符合泰国文化和语言的特殊要求。例如，在客户服务、教育、法律和医疗等领域，这些模型需要准确理解和生成符合泰国文化背景的文本。此外，这些数据集还支持跨文化交流和多语言系统的开发，促进了全球语言技术的均衡发展。

衍生相关工作

基于Thai-H6和ThaiCLI数据集，许多相关研究工作得以展开。例如，有研究者利用这些数据集开发了针对泰语的情感分析和命名实体识别模型，进一步提升了泰语NLP任务的性能。此外，还有研究探讨了如何通过多语言预训练模型来提升泰语LLMs的表现，以及如何设计更加文化敏感的评估框架。这些衍生工作不仅丰富了泰语NLP的研究内容，也为其他低资源语言的LLMs研究提供了宝贵的经验。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集