SIFo Benchmark|大型语言模型数据集|指令跟随数据集

arXiv2024-06-28 更新2024-07-22 收录

大型语言模型

指令跟随

下载链接：

https://github.com/shin-ee-chen/SIFo

下载链接

链接失效反馈

资源简介：

SIFo Benchmark是由阿姆斯特丹大学和格罗宁根大学创建的一个用于评估大型语言模型（LLMs）顺序指令跟随能力的基准数据集。该数据集包含20个样本，每个样本包含3到6个指令，涉及文本修改、问答、数学和安全规则跟随等任务。数据集的创建过程采用规则基础的管道，确保指令的顺序性和连贯性。该数据集主要用于评估和改进LLMs在复杂任务中遵循一系列指令的能力，特别是在需要顺序执行指令以达到预期结果的场景中。

提供机构：

阿姆斯特丹大学, 格罗宁根大学

创建时间：

2024-06-28

原始信息汇总

SIFo 数据集概述

概述

SIFo 数据集旨在评估大型语言模型（LLMs）遵循多个指令的能力。该数据集通过顺序指令遵循（SIFo）任务来解决以下挑战：

多个指令之间的有限连贯性。
位置偏差，即指令顺序影响模型性能。
缺乏客观可验证的任务。

SIFo 数据集包含四个任务，用于评估模型在不同方面的顺序指令遵循能力：

文本修改
问答
数学
安全规则遵循

通过对流行的大型语言模型（包括闭源和开源模型）的评估，结果显示较新和较大的模型在 SIFo 任务上显著优于较旧和较小的模型，验证了该基准的有效性。所有模型在遵循指令序列方面都存在困难，这表明当前语言模型在鲁棒性方面存在重要缺陷。

AI搜集汇总

数据集介绍

构建方式

SIFo Benchmark的构建旨在评估大型语言模型（LLMs）在遵循多步骤指令任务中的能力。该数据集通过设计四个任务（文本修改、问答、数学和安全规则遵循）来实现这一目标，每个任务都涉及多个指令的顺序执行。数据集的构建采用了规则驱动的方法，确保每个任务的指令之间具有内在的连贯性，并且每个指令的完成依赖于前一个指令的结果。这种设计消除了位置偏差的影响，并确保最终指令的正确性可以验证整个指令序列的遵循情况。

特点

SIFo Benchmark的主要特点在于其任务的顺序依赖性和客观可验证性。每个任务的指令都是顺序连接的，当前步骤的成功依赖于前一步骤的结果。这种设计确保了指令之间的内在连贯性，并避免了位置偏差的影响。此外，所有任务的结果都可以通过检查最终指令的正确性来进行客观验证，从而简化了评估过程。

使用方法

使用SIFo Benchmark时，用户需要将多个指令和上下文输入到模型中，并要求模型按照指令的顺序逐一执行。模型的输出应采用JSON格式，以便于提取每个指令的答案。评估模型性能时，可以通过检查最终指令的正确性来验证模型是否正确遵循了整个指令序列。此外，还可以通过指令级别的准确性和指令遵循深度等指标来进一步分析模型的表现。

背景与挑战

背景概述

随着大型语言模型（LLMs）在遵循指令方面的显著进步，评估其处理多步骤指令的能力变得尤为关键。SIFo Benchmark由阿姆斯特丹大学和格罗宁根大学的研究人员于2024年提出，旨在通过顺序指令跟随（SIFo）任务评估模型遵循多步骤指令的能力。该基准的核心研究问题包括指令之间的连贯性、位置偏差对模型性能的影响以及缺乏客观可验证的任务。SIFo Benchmark通过四个任务（文本修改、问答、数学和安全规则遵循）来评估模型的顺序指令跟随能力，展示了其在评估大型语言模型中的重要性和影响力。

当前挑战

SIFo Benchmark在构建和应用过程中面临多项挑战。首先，多步骤指令之间的连贯性有限，导致模型难以准确理解并执行后续指令。其次，位置偏差问题使得指令顺序对模型性能产生显著影响，增加了评估的复杂性。此外，缺乏客观可验证的任务使得评估结果难以标准化和比较。在构建过程中，研究人员需确保指令的顺序依赖性和任务的客观可验证性，以提高基准的有效性和可靠性。这些挑战不仅影响了基准的评估准确性，也对未来大型语言模型的改进提出了更高的要求。

常用场景

经典使用场景

SIFo Benchmark 主要用于评估大型语言模型（LLMs）在遵循多步骤指令序列方面的能力。其经典使用场景包括文本修改、问答、数学计算和安全规则遵循等任务。在这些任务中，模型需要按照给定的顺序执行一系列指令，每个后续指令的正确执行依赖于前一个指令的结果。通过这种方式，SIFo Benchmark 能够全面评估模型在处理复杂、多步骤任务时的表现。

解决学术问题

SIFo Benchmark 解决了当前大型语言模型在多步骤指令遵循能力评估中的几个关键问题，包括指令之间的连贯性不足、位置偏差对模型性能的影响以及缺乏客观可验证的任务。通过设计序列化的指令任务，SIFo Benchmark 确保了指令之间的内在连贯性，并避免了位置偏差的影响。此外，该基准通过客观可验证的任务设计，提供了一种更为公平和可重复的评估方法，从而推动了相关领域的研究进展。

衍生相关工作

SIFo Benchmark 的提出激发了大量相关研究工作，特别是在多步骤指令遵循和复杂任务处理领域。例如，有研究者基于 SIFo Benchmark 开发了新的评估方法，以进一步细化模型在不同类型指令中的表现；还有研究者利用 SIFo Benchmark 的数据集进行模型训练，以提升模型在多步骤任务中的鲁棒性。此外，SIFo Benchmark 的成功应用也促使其他领域开始探索类似的序列化任务设计，从而推动了整个自然语言处理领域的发展。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

Tropicos

Tropicos是一个全球植物名称数据库，包含超过130万种植物的名称、分类信息、分布数据、图像和参考文献。该数据库由密苏里植物园维护，旨在为植物学家、生态学家和相关领域的研究人员提供全面的植物信息。

www.tropicos.org 收录

AIS数据集

该研究使用了多个公开的AIS数据集，这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶，并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息，总计约6.4亿条记录。

github 收录

cricket_data

该数据集包含了多种板球比赛的数据，包括每场比赛的详细信息，如比赛日期、地点、参赛队伍、比赛结果等。数据以文件形式存储，每个文件对应不同的比赛信息，如投球数据、比赛日期、比赛信息、比赛详情、元数据、比赛结果、最有价值球员、超级替补、参赛队伍、抛硬币结果和裁判员信息等。

github 收录

NAEP - National Assessment of Educational Progress

NAEP（国家教育进展评估）数据集包含了美国全国范围内对学生学术成就的定期评估结果。该数据集涵盖了多个学科领域，如阅读、数学、科学等，并提供了不同年级和不同州的数据。数据集还包括了学生的背景信息和社会经济因素，以帮助分析教育成就的影响因素。

nces.ed.gov 收录