Multi-task Spatial Evaluation Dataset

Name: Multi-task Spatial Evaluation Dataset
Creator: 浙江农林大学数学与计算机科学学院
Published: 2024-08-27 01:25:16
License: 暂无描述

arXiv2024-08-27 更新2024-08-28 收录

下载链接：

http://arxiv.org/abs/2408.14438v1

下载链接

链接失效反馈

官方服务：

资源简介：

本研究引入了一个新颖的多任务空间评估数据集，旨在系统地探索和比较几种先进模型在空间任务上的表现。该数据集包含十二种不同的任务类型，涵盖空间理解和路径规划等，每种任务都有经过验证的准确答案。数据集的设计和构建过程经过多轮严格审查和验证，确保了问题的准确性和实用性。该数据集主要应用于评估大型语言模型在地理信息系统（GIS）领域的空间任务能力，旨在解决模型在处理复杂空间环境中的导航和逻辑推理问题。

This study presents a novel multi-task spatial evaluation dataset, designed to systematically explore and compare the performance of several state-of-the-art models on spatial tasks. This dataset encompasses twelve distinct task types, spanning spatial comprehension, path planning and other related scenarios, with validated accurate answers for each individual task. The design and development process of the dataset has undergone multiple rounds of rigorous review and validation, ensuring the accuracy and practicality of the included questions. This dataset is primarily utilized to assess the spatial task proficiency of Large Language Models (LLMs) within the Geographic Information System (GIS) domain, aiming to address the challenges that models face when handling navigation and logical reasoning tasks in complex spatial environments.

提供机构：

浙江农林大学数学与计算机科学学院

创建时间：

2024-08-27

搜集汇总

数据集介绍

构建方式

Multi-task Spatial Evaluation Dataset (MSED) was meticulously crafted to appraise the proficiency of large language models (LLMs) in spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding, path planning, and more, each with verified, accurate answers. The construction process involved a team of experts determining the types of questions to include, based on six key dimensions: conceptual topics, explanatory subjects, intellectual topics, operational problems, inference problems, and applied problems. The questions were designed to reflect real-world application needs and were sourced from various academic papers and online resources such as Wikipedia. The dataset was then rigorously reviewed and validated to ensure accuracy and practicality.

特点

The MSED is characterized by its comprehensiveness, covering a wide range of spatial tasks that evaluate LLMs' abilities in understanding and manipulating spatial configurations, planning optimal routes, recognizing geographic entities, and more. It also includes a diverse set of questions, from rare and complex to more straightforward, ensuring a thorough assessment of LLMs' capabilities. The dataset is structured into twelve distinct task categories, each with specific question examples, providing a clear framework for evaluation. Additionally, the MSED introduces a new metric, Weighted Accuracy (WA), which assigns different weights to the scores of each response to accurately reflect the importance and difficulty of the tasks, offering a more nuanced assessment of model performance.

使用方法

The MSED is utilized through a two-phase testing approach. Initially, a zero-shot test is conducted to evaluate the models' initial performance without any prompt tuning. Following this, the dataset is categorized by difficulty, and prompt tuning tests are performed using various strategies such as One-shot, Combined Techniques Prompt, Chain of Thought (CoT), and Zero-shot-CoT. To ensure the rigor and reproducibility of the experimental process, an automated script is developed for standardizing the answer collection process, interacting with the models using text input only and adhering to a single-turn dialogue mode. The dataset is evaluated using accuracy statistics and qualitative analysis, with human reviewers playing a crucial role in the scoring process, especially for partially correct answers. The MSED provides a systematic and comprehensive benchmark to test the performance of different LLMs on various spatial tasks, allowing for a thorough assessment of their capabilities and limitations.

背景与挑战

背景概述

随着大型语言模型如ChatGPT、Gemini等的出现，对其多样能力的评估变得尤为重要，包括自然语言理解、代码生成等。然而，这些模型在空间任务上的性能尚未得到全面评估。本研究通过引入一个新的多任务空间评估数据集，旨在系统地探索和比较多个先进模型在空间任务上的性能。该数据集包含12种不同的任务类型，包括空间理解和路径规划，每个任务类型都有经过验证的准确答案。研究评估了多个模型，包括OpenAI的gpt-3.5-turbo、gpt-4o和ZhipuAI的glm-4，通过两阶段的测试方法。首先进行零样本测试，然后将数据集按难度分类并进行提示调整测试。结果显示，gpt-4o在第一阶段中实现了最高的总体准确率，平均为71.3%。尽管moonshot-v1-8k的整体表现略低，但在地名识别任务中超过了gpt-4o。研究还强调了提示策略对特定任务中模型性能的影响。例如，Chain-of-Thought (COT)策略将gpt-4o在路径规划中的准确率从12.4%提高到87.5%，而单次策略将moonshot-v1-8k在绘图任务中的准确率从10.1%提高到76.3%。

当前挑战

该数据集相关的挑战包括：1) 解决的领域问题是空间任务，包括空间理解和路径规划等；2) 构建过程中遇到的挑战，如数据收集和验证、任务设计、模型评估和提示策略优化等。

常用场景

经典使用场景

该数据集主要用于评估大型语言模型在空间任务上的性能。通过对多个模型在十二种不同类型的空间任务上的表现进行比较，该数据集为研究人员提供了一个系统性的基准。这些任务包括空间理解、路径规划、地理特征搜索等，每个任务都有经过验证的准确答案。研究结果表明，gpt-4o在第一阶段取得了最高的整体准确率，平均为71.3%。虽然moonshot-v1-8k的整体表现略低，但在地名识别任务中超过了gpt-4o。此外，该研究还强调了提示策略对特定任务中模型性能的影响。例如，思维链(COT)策略将gpt-4o在路径规划中的准确率从12.4%提高到87.5%，而一次提示策略将moonshot-v1-8k在地图任务中的准确率从10.1%提高到76.3%。

衍生相关工作

该数据集的引入为大型语言模型在空间任务上的研究开辟了新的方向。基于该数据集，研究人员可以进一步探索不同模型的性能和提示策略的有效性，以及如何优化模型以更好地处理复杂的空间任务。此外，该数据集还可以与其他评估大型语言模型性能的基准相结合，从而更全面地评估模型的综合能力。

数据集最近研究