Multi-Task Spatial Evaluation Dataset

Name: Multi-Task Spatial Evaluation Dataset
Creator: 浙江农林大学数学与计算机科学学院
Published: 2024-09-02 19:59:05
License: 暂无描述

arXiv2024-09-02 更新2024-09-05 收录

下载链接：

https://figshare.com/s/be55522f22bf761cfcab

下载链接

链接失效反馈

官方服务：

资源简介：

本研究引入了一个全面的多任务空间评估数据集，旨在系统评估大型语言模型在空间任务上的性能。该数据集由浙江农林大学数学与计算机科学学院等机构创建，包含12种不同类型的任务，共计900个问题，涵盖空间理解、路径规划等多个领域。数据集的创建过程经过精心设计，确保问题既具有理论价值又具有实际应用价值。此数据集主要用于评估模型在地理信息系统（GIS）领域的应用能力，特别是在空间推理和路径规划等任务中的表现。

This study introduces a comprehensive multi-task spatial evaluation dataset designed to systematically assess the performance of large language models in spatial tasks. Developed by institutions including the School of Mathematics and Computer Science of Zhejiang A&F University, this dataset comprises 900 questions across 12 distinct task types, covering multiple domains such as spatial understanding and path planning. The dataset was meticulously constructed to ensure that all included questions possess both theoretical value and practical application value. This dataset is primarily utilized to evaluate the applicability of models in the field of Geographic Information Systems (GIS), particularly their performance on tasks including spatial reasoning and path planning.

提供机构：

浙江农林大学数学与计算机科学学院

创建时间：

2024-08-27

搜集汇总

数据集介绍

构建方式

Multi-Task Spatial Evaluation Dataset (M-SPED) is a comprehensive benchmarking dataset designed to assess the performance of large language models (LLMs) on spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding, path planning, and toponymic identification, each with verified and accurate answers. The dataset was constructed through a structured and efficient data collection strategy involving a team of experts who determined the types of questions to include based on six key dimensions: conceptual topics, explanatory subjects, intellectual topics, operational problems, inference problems, and applied problems. The questions were designed to reflect real-world application needs and to provide broad coverage as well as significant challenges. The dataset underwent multiple rounds of rigorous review and validation to ensure the accuracy and practicality of the questions.

特点

The M-SPED dataset is characterized by its comprehensiveness, covering a wide range of spatial tasks, and its structured approach to question design, which aligns with real-world application needs. It includes rare and complex questions distributed across 12 different categories, covering various aspects from GIS concepts to programming skills. The dataset is also notable for its use of verified and accurate answers for each task type, which allows for rigorous evaluation of model performance. Additionally, the dataset is structured to allow for classification by difficulty level, which can be used to evaluate the performance of models on tasks of varying complexity.

使用方法

The M-SPED dataset can be used to evaluate the performance of LLMs on spatial tasks. To use the dataset, researchers can select specific task types or difficulty levels to assess the performance of a particular model. The dataset can also be used to compare the performance of different models on the same tasks. In addition to evaluation, the dataset can be used for training and fine-tuning LLMs on spatial tasks. The dataset provides a comprehensive and challenging set of tasks that can be used to improve the spatial reasoning and problem-solving abilities of LLMs.

背景与挑战

背景概述

随着大型语言模型如ChatGPT、Gemini等的出现，评估其多样能力的重要性日益凸显，这些能力包括自然语言理解、代码生成等。然而，这些模型在空间任务上的性能尚未得到全面评估。本研究旨在填补这一空白，引入了一个新颖的多任务空间评估数据集，旨在系统地探索和比较多个先进模型在空间任务上的性能。该数据集包含十二种不同的任务类型，包括空间理解和路径规划，每种任务类型都有经过验证的准确答案。我们通过两阶段测试方法评估了多个模型，包括OpenAI的gpt-3.5-turbo、gpt-4o和ZhipuAI的glm-4。研究结果表明，gpt-4o在第一阶段实现了最高的总体准确率，平均为71.3%。虽然moonshot-v1-8k的整体表现略低于gpt-4o，但在地名识别任务中超过了gpt-4o。本研究还强调了提示策略对模型在特定任务中的性能的影响。例如，思维链（COT）策略将gpt-4o在路径规划中的准确率从12.4%提高到87.5%，而一次提示策略则将moonshot-v1-8k在制图任务中的准确率从10.1%提高到76.3%。

当前挑战

该数据集所解决的领域问题是大型语言模型在空间任务上的性能评估。构建过程中遇到的挑战包括：1) 设计和构建一个全面的空间任务数据集，涵盖从GIS概念到编程技能的各个方面；2) 评估多个模型的性能，包括OpenAI的gpt-3.5-turbo、gpt-4o和ZhipuAI的glm-4；3) 设计一套全面的测试脚本，使用API调用和精确的控制参数设置，以确保实验过程的严谨性和可重复性；4) 进行两轮测试，包括零样本测试和基于难度的提示调整测试；5) 开发全面的评估策略，并引入一个新的指标WA，以更直观地观察模型的性能。

常用场景

经典使用场景

该数据集主要用于评估大型语言模型在空间任务上的性能，包括空间理解、路径规划和地名识别等十二种不同的任务类型。通过零样本测试和提示调整测试，研究人员对多个模型进行了评估，包括OpenAI的gpt-3.5-turbo、gpt-4o和ZhipuAI的glm-4。结果显示，gpt-4o在第一阶段中取得了最高的整体准确率，平均为71.3%。这项研究还强调了提示策略对模型在特定任务中的性能的影响，例如，思维链（COT）策略将gpt-4o在路径规划中的准确率从12.4%提高到87.5%，而一次策略增强了moonshot-v1-8k在地图任务中的准确率从10.1%提高到76.3%。

解决学术问题

该数据集解决了大型语言模型在空间任务上的性能评估问题。虽然大型语言模型在自然语言理解和代码生成等方面表现出色，但它们在空间任务上的性能尚未得到全面评估。该数据集通过涵盖十二种不同的任务类型，包括空间理解和路径规划，系统地探索和比较了多个先进模型在空间任务上的性能，为评估大型语言模型的空间推理能力提供了一个全面的基准。

衍生相关工作

该数据集衍生了多个相关的经典工作，例如，通过该数据集评估了ChatGPT和gpt-4等模型在从与灾难相关的社交媒体信息中准确提取地理位置方面的能力，以及评估了大型语言模型（例如，gpt-4经过各种少样本提示方法优化和微调的BART和T5模型）在“路线规划”任务中的时空推理能力，这些任务要求模型在避免预设障碍的情况下导航到目标位置。此外，该数据集还用于评估大型语言模型（例如，ChatGPT-3.5、ChatGPT-4和Llama 2-7B）在3D机器人轨迹数据和相关任务上的空间推理能力，包括2D方向和形状标记。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集