ds4sd/SemTabNet

Name: ds4sd/SemTabNet
Creator: ds4sd
Published: 2024-06-28 07:46:33
License: 暂无描述

Hugging Face2024-06-28 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/ds4sd/SemTabNet

下载链接

链接失效反馈

官方服务：

资源简介：

SemTabNet数据集是一个用于通用信息提取的数据集，特别是针对表格数据的Statement Extraction (SE)任务。该数据集包含训练集、验证集和测试集，数据量分别为103455、11682和5445条记录。数据集的主要任务是将原始输入（表格或文本）转换为Statements，这属于通用信息提取的范畴。数据集的文本语言为英语，数据来源和注释策略在相关论文中有详细描述。数据集伴随的论文标题为《Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs》，作者包括Lokesh Mishra等，论文被ACL 2024的NLP4Climate workshop接受。

The SemTabNet dataset is designed for universal information extraction, specifically for the task of Statement Extraction (SE) from tabular data. The dataset includes training, validation, and test sets with 103455, 11682, and 5445 records respectively. The primary task is to convert raw inputs (tables or text) into Statements, which falls under the category of universal information extraction. The text in the dataset is in English, and the source and annotation strategy are detailed in the accompanying paper. The paper, titled Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs, authored by Lokesh Mishra et al., was accepted at the NLP4Climate workshop at ACL 2024.

提供机构：

ds4sd

原始信息汇总

数据集概述

基本信息

名称: SemTabNet
许可证: MIT
任务类别:
- 特征提取
- 表格问答
- 文本生成
数据规模: 100K<n<1M
语言: 英语
标签:
- 信息提取
- 表格理解
- 气候
- ESG

数据集描述

任务:
- Statement Extraction (SE): 将原始输入（表格或文本）转换为Statements的任务，属于通用信息提取范畴。

数据分割

任务:
- SE Direct:
  - 训练集: 103455
  - 测试集: 11682
  - 验证集: 5445
- SE Indirect 1D:
  - 训练集: 72580
  - 测试集: 8489
  - 验证集: 3821
- SE Indirect 2D:
  - 训练集: 93153
  - 测试集: 22839
  - 验证集: 4903

语言

文本语言: 英语

引用信息

Arxiv: https://arxiv.org/abs/2406.19102

5,000+

优质数据集

54 个

任务类型

进入经典数据集