Vietnamese-Function-Calling-Test

Hugging Face2024-11-28 更新2024-12-12 收录

下载链接：

https://huggingface.co/datasets/phamhai/Vietnamese-Function-Calling-Test

下载链接

链接失效反馈

官方服务：

资源简介：

Vietnamese Function Calling Benchmark数据集包含2899个单轮函数调用样本，涵盖银行、保险、旅行、教育、健康、招聘、车辆控制、购物、工作和汽车服务等多个领域。数据集包含159个函数，旨在为越南语聊天系统中的RAG应用提供一个全面的基准。

The Vietnamese Function Calling Benchmark dataset contains 2,899 single-turn function calling samples, covering multiple domains such as banking, insurance, travel, education, healthcare, recruitment, vehicle control, shopping, work, and automotive services. The dataset includes 159 functions, aiming to provide a comprehensive benchmark for RAG applications in Vietnamese chat systems.

创建时间：

2024-11-19

原始信息汇总

Vietnamese Function Calling Benchmark

数据集详情

数据大小: 2899个单轮函数调用样本
领域:
- 银行
- 保险
- 旅行
- 教育
- 健康
- 招聘
- 车辆控制
- 购物
- 工作
- 汽车服务
函数数量: 159个函数

模型评估

模型名称	模型大小	函数名称准确率 (%)	完全匹配准确率 (%)
phamhai/Llama-3.2-3B-Instruct-Frog	~3B	95.79	51.05
Gemini-1.5-Pro	---	96.96	55.16
Gemini-1.5-Flash	---	97.10	51.64
Gemini-1.5-Flash-8B	---	97.38	64.75
gpt-4o-2024-08-06	---	94.38	52.88
arcee-ai/Arcee-VyLinh	~3B	---	---
phamhai/Llama-3.2-3B-Instruct-Frog-Pro	~3B	98.12	56.38

评估代码

加载模型和数据集

python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset

model_path = "phamhai/Llama-3.2-3B-Instruct-Frog" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, force_download=True, device_map="auto", torch_dtype=torch.bfloat16)

dataset = load_dataset(phamhai/Vietnamese-Function-Calling-Test)

Frog模型推理代码

python from tqdm import tqdm

def infer(text, tools): messages = [ {"role": "system", "content": Bạn là một trợ lý hữu ích với khả năng truy cập vào các hàm sau. Hãy chọn một trong các công cụ được cung cấp dưới đây để sử dụng cho việc trả lời câu hỏi của người dùng - %s % , .join(tools)}, {"role": "user", "content": text}] tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

tokenized_chat = tokenized_chat.to(cuda:0)

outputs = model.generate(tokenized_chat, max_new_tokens=128)
return tokenizer.decode(outputs[0]).split(<functioncall> )[-1].replace(<|eot_id|>, )

preds = [] golds = []

for d in tqdm(dataset[test]): golds.append(d[output]) preds.append(infer(d[input_text], d[tools]))

Gemini-1.5-Pro推理代码

python import google.generativeai as genai import os from google.generativeai.types import content_types from collections.abc import Iterable from tqdm import tqdm

def tool_config_from_mode(mode: str, fns: Iterable[str] = ()): return content_types.to_tool_config( {"function_calling_config": {"mode": mode}} )

tool_config = tool_config_from_mode("any")

genai.configure(api_key="")

def infer_gemini_with_tools(text, tools): model = genai.GenerativeModel("gemini-1.5-pro")

prepare_tools_for_gem = [] 
for tool in tools:
    tool = eval(tool)
    if len(tool[parameters][properties]) == 0:
        tool.pop(parameters, None)
    prepare_tools_for_gem.append(tool)

i = 0
while True:
    try:
        i += 1
        response = model.generate_content(
            text,
            tools=[{"function_declarations": prepare_tools_for_gem}],
            generation_config=genai.GenerationConfig(
                max_output_tokens=1000,
                temperature=0.1,
            ),
            tool_config=tool_config
        )
        if "function_call" in response.candidates[0].content.parts[0]:
            return {
                name: response.candidates[0].content.parts[0].function_call.name,
                arguments: dict(response.candidates[0].content.parts[0].function_call.args)
            }
        else:
            return {
                name: response.candidates[0].content.parts[0].text,
                arguments: 
            }
    except Exception as e:
        print(e)
    if i > 10:
        return response

preds = [] golds = []

for d in tqdm(dataset[test]): golds.append(d[output]) preds.append(infer_gemini_with_tools(d[input_text], d[tools]))

OpenAIs GPT-4o推理代码

python import openai from openai import OpenAI import json

client = OpenAI(api_key="")

def infer_gpt4o_with_tools(text, tools): prepare_tools_for_gpt = [] for tool in tools: tool = eval(tool) if len(tool[parameters][properties]) == 0: tool.pop(parameters, None) prepare_tools_for_gpt.append({ "type": "function", "function": tool, })

messages = [{"role": "user", "content": text}]

c = 0
while True:
    try:            
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=prepare_tools_for_gpt,
            temperature=0,
            tool_choice="required",
        )
        response_message = response.choices[0].message
        if response_message.tool_calls:
            tool_call = response_message.tool_calls[0]
            return {
                name: tool_call.function.name,
                arguments: dict(json.loads(tool_call.function.arguments))
            }
        else:
            return {
                name: not using tool,
                arguments: 
            }
    except Exception as e:
        print(e)
        c += 1
    if c > 3:
        return {
                name: not using tool,
                arguments: 
            }

preds = [] golds = []

for d in tqdm(dataset[test]): golds.append(d[output]) preds.append(infer_gpt4o_with_tools(d[input_text], d[tools]))

获取准确率代码

python import json with open(./test_results.json, r) as f_r: preds, golds = json.load(f_r)

correct_fc_name = 0 correct_full_fc = 0

for i in range(len(preds)): try: if type(preds[i]) == str and not preds[i].endswith("}}"): preds[i] = preds[i] + } p = eval(preds[i]) g = eval(golds[i].replace(<functioncall> , )) if p[name] == g[name]: correct_fc_name += 1 if p == g: correct_full_fc += 1 except: pass

print("Accuracy in classifying into the correct function name: ", correct_fc_name / len(preds)) print("Accuracy in classifying into the correct function and all associated parameters: ", correct_full_fc / len(preds))

联系作者

邮箱: phamhuuhai1402@gmail.com

搜集汇总

数据集介绍

构建方式

Vietnamese-Function-Calling-Test数据集的构建旨在为越南语功能调用任务提供一个全面且标准化的基准。该数据集包含2899个单轮功能调用样本，涵盖了银行、保险、旅游、教育、健康、招聘、车辆控制、购物、工作和汽车服务等多个领域。数据集的构建过程通过精心设计的样本收集和标注，确保每个样本都能准确反映实际应用场景中的功能调用需求。

特点

该数据集的特点在于其多样性和广泛的应用场景。它不仅涵盖了多个行业领域，还包含了159个不同的功能调用，能够全面评估模型在不同情境下的表现。数据集的设计注重实际应用，确保每个样本都能真实反映越南语功能调用的复杂性。此外，数据集还提供了详细的模型评估结果，帮助用户了解不同模型在功能调用任务中的表现。

使用方法

使用Vietnamese-Function-Calling-Test数据集时，用户可以通过Hugging Face平台加载数据集，并利用提供的代码进行模型推理和评估。数据集支持多种主流模型，包括Llama、Gemini和GPT-4o等。用户可以根据需要选择不同的模型进行测试，并通过提供的评估代码计算模型在功能调用任务中的准确率。数据集的使用方法简单直观，适合研究人员和开发者在实际项目中进行模型选择和优化。

背景与挑战

背景概述

随着越南语聊天机器人系统的广泛应用，基于检索增强生成（RAG）的应用逐渐成为研究热点。尽管许多大型语言模型（LLM）已支持越南语功能调用（Function Calling, FC），但该领域缺乏一个统一且全面的基准测试数据集。在此背景下，越南语功能调用测试数据集（Vietnamese-Function-Calling-Test）应运而生，旨在为产品团队提供一个标准化的评估工具，以便合理选择模型。该数据集由phamhai等研究人员于近期发布，涵盖了银行、保险、旅游、教育、健康等多个领域的2899个单轮功能调用样本，涉及159种不同功能。这一数据集的发布不仅填补了越南语功能调用领域的空白，还为相关研究提供了重要的数据支持。

当前挑战

越南语功能调用测试数据集在构建和应用过程中面临多重挑战。首先，越南语作为一种低资源语言，其语法结构和词汇特性与英语等主流语言存在显著差异，这为功能调用的准确识别和参数提取带来了困难。其次，数据集的构建需要涵盖多个领域的功能调用样本，以确保其广泛性和代表性，这对数据的收集和标注提出了较高要求。此外，功能调用任务本身具有复杂性，模型不仅需要准确识别用户意图，还需正确选择并调用相应的功能，这对模型的推理能力和泛化性能提出了严峻考验。最后，数据集的评估标准尚未完全统一，如何设计科学合理的评估指标以全面衡量模型性能，仍需进一步探索。

常用场景

经典使用场景

Vietnamese-Function-Calling-Test数据集在越南语自然语言处理领域具有重要地位，尤其在越南语聊天机器人系统的开发中，该数据集被广泛用于评估和优化函数调用（Function Calling, FC）任务。通过提供2899个单轮函数调用样本，涵盖了银行、保险、旅游、教育、健康等多个领域，该数据集为研究人员和开发者提供了一个标准化的基准，用于测试和比较不同大语言模型（LLM）在越南语函数调用任务中的表现。

实际应用

在实际应用中，Vietnamese-Function-Calling-Test数据集被广泛用于越南语聊天机器人和智能助手的开发。通过使用该数据集，产品团队能够选择最适合其应用场景的模型，从而提高系统的响应速度和准确性。此外，该数据集还被用于优化越南语语音识别和自然语言理解系统，提升了用户体验和系统性能。

衍生相关工作

该数据集衍生了一系列经典工作，包括基于Llama、Gemini和GPT-4等模型的越南语函数调用任务研究。这些工作不仅验证了数据集的有效性，还推动了越南语自然语言处理技术的创新。例如，phamhai/Llama-3.2-3B-Instruct-Frog模型在该数据集上的表现优异，为后续研究提供了重要的技术参考和优化方向。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集