TabFact

Name: TabFact
Creator: 加州大学圣巴巴拉分校
Published: 2020-06-15 03:14:22
License: 暂无描述

arXiv2020-06-15 更新2024-06-21 收录

下载链接：

https://github.com/wenhuchen/Table-Fact-Checking

下载链接

链接失效反馈

官方服务：

资源简介：

TabFact是一个大规模的数据集，用于基于表格的事实验证。该数据集由加州大学圣巴巴拉分校和腾讯AI Lab创建，包含16,000个维基百科表格作为证据，用于验证118,000个人工标注的自然语言陈述，这些陈述被标记为'ENTAILED'或'REFUTED'。TabFact的挑战在于它涉及软语言推理和硬符号推理。数据集创建过程中，通过众包方式收集了不同难度级别的陈述，并进行了质量控制以确保数据的准确性。TabFact的应用领域包括自然语言理解和语义表示的研究，旨在解决基于结构化证据的事实验证问题。

TabFact is a large-scale dataset for table-based fact verification. Created by the University of California, Santa Barbara and Tencent AI Lab, it contains 16,000 Wikipedia tables as evidence, which are used to validate 118,000 manually annotated natural language statements labeled as 'ENTAILED' or 'REFUTED'. The core challenge of TabFact lies in its integration of both soft linguistic reasoning and hard symbolic reasoning. During the dataset construction process, statements of varying difficulty levels were collected via crowdsourcing, and quality control measures were adopted to ensure the accuracy of the dataset. The application domains of TabFact include research in natural language understanding and semantic representation, with the goal of solving structured evidence-based fact verification problems.

提供机构：

加州大学圣巴巴拉分校

创建时间：

2019-09-05

搜集汇总

数据集介绍

构建方式

TabFact数据集的构建基于16,000个维基百科表格，这些表格作为证据用于验证118,000条人工标注的自然语言陈述。这些陈述被标记为‘ENTAILED’或‘REFUTED’。数据集的构建过程包括从维基百科中提取表格，并通过众包平台Amazon Mechanical Turk进行人工标注。标注过程中，采用了‘positive two-channel collection’和‘negative rewriting strategy’来确保数据质量，同时通过质量控制措施进一步筛选和过滤标注数据。

特点

TabFact数据集的特点在于其结合了软语言推理和硬符号推理，这使得数据集具有挑战性。数据集中的陈述不仅需要理解表格中的信息，还需要进行复杂的逻辑推理。此外，数据集通过设计两种不同的标注渠道（简单和复杂）来收集不同难度的陈述，从而更好地评估模型的性能。

使用方法

TabFact数据集可用于训练和评估基于表格的事实验证模型。研究者可以使用数据集中的表格和陈述对模型进行训练，并通过验证集和测试集来评估模型的性能。数据集还提供了两种不同的模型（Table-BERT和Latent Program Algorithm）作为参考，研究者可以在此基础上进行改进和创新。数据集的代码和数据可在GitHub上获取，便于进一步的研究和应用。

背景与挑战

背景概述

TabFact, introduced in 2020 by researchers from the University of California, Santa Barbara, and Tencent AI Lab, addresses the critical challenge of fact verification using semi-structured data, specifically tables. The dataset comprises 118,000 human-annotated statements derived from 16,000 Wikipedia tables, classified as either ENTAILED or REFUTED. This initiative fills a significant gap in the field of natural language understanding, where previous studies primarily focused on unstructured evidence. TabFact's creation underscores the importance of structured data in real-world applications, such as database systems and dialog systems, and its impact on advancing AI's ability to reason over both linguistic and symbolic forms.

当前挑战

TabFact presents several challenges. Firstly, it requires the integration of soft linguistic reasoning with hard symbolic reasoning, necessitating models that can handle both semantic understanding and structured data execution. Secondly, the dataset's construction faced difficulties in ensuring high-quality human annotations, managing annotation artifacts, and maintaining inter-annotator agreement. Thirdly, the models designed to tackle TabFact, such as Table-BERT and Latent Program Algorithm (LPA), must overcome the limitations of existing pre-trained language models in handling structured data and the complexities of program synthesis. These challenges highlight the need for innovative approaches that can effectively bridge linguistic and symbolic reasoning in AI systems.

常用场景

经典使用场景

TabFact 数据集的经典使用场景在于基于半结构化表格的事实验证任务。该数据集通过提供 16k 维基百科表格和 118k 人工标注的自然语言陈述，支持模型判断陈述是否被表格内容所支持或反驳。这种场景特别适用于需要软语言推理和硬符号推理相结合的任务，如自然语言推理和语义表示。

解决学术问题

TabFact 数据集解决了在结构化证据（如表格、图表和数据库）下进行事实验证的学术研究问题。传统研究主要集中在非结构化证据（如自然语言句子和文档），而结构化证据下的验证研究相对不足。TabFact 通过提供大规模的半结构化数据，推动了这一领域的研究进展，对于提升自然语言理解和语义表示具有重要意义。

衍生相关工作

TabFact 数据集的提出催生了一系列相关研究工作，特别是在基于表格的事实验证和自然语言推理领域。例如，研究者们开发了 Table-BERT 和 Latent Program Algorithm (LPA) 等模型，这些模型利用预训练语言模型和符号执行技术来处理表格数据。此外，TabFact 还促进了在多模态语言推理和程序合成与语义解析等方向的研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集