ds4sd/SemTabNet
收藏Hugging Face2024-06-28 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/ds4sd/SemTabNet
下载链接
链接失效反馈官方服务:
资源简介:
SemTabNet数据集是一个用于通用信息提取的数据集,特别是针对表格数据的Statement Extraction (SE)任务。该数据集包含训练集、验证集和测试集,数据量分别为103455、11682和5445条记录。数据集的主要任务是将原始输入(表格或文本)转换为Statements,这属于通用信息提取的范畴。数据集的文本语言为英语,数据来源和注释策略在相关论文中有详细描述。数据集伴随的论文标题为《Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs》,作者包括Lokesh Mishra等,论文被ACL 2024的NLP4Climate workshop接受。
The SemTabNet dataset is designed for universal information extraction, specifically for the task of Statement Extraction (SE) from tabular data. The dataset includes training, validation, and test sets with 103455, 11682, and 5445 records respectively. The primary task is to convert raw inputs (tables or text) into Statements, which falls under the category of universal information extraction. The text in the dataset is in English, and the source and annotation strategy are detailed in the accompanying paper. The paper, titled Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs, authored by Lokesh Mishra et al., was accepted at the NLP4Climate workshop at ACL 2024.
提供机构:
ds4sd
原始信息汇总
数据集概述
基本信息
- 名称: SemTabNet
- 许可证: MIT
- 任务类别:
- 特征提取
- 表格问答
- 文本生成
- 数据规模: 100K<n<1M
- 语言: 英语
- 标签:
- 信息提取
- 表格理解
- 气候
- ESG
数据集描述
- 任务:
- Statement Extraction (SE): 将原始输入(表格或文本)转换为Statements的任务,属于通用信息提取范畴。
数据分割
- 任务:
- SE Direct:
- 训练集: 103455
- 测试集: 11682
- 验证集: 5445
- SE Indirect 1D:
- 训练集: 72580
- 测试集: 8489
- 验证集: 3821
- SE Indirect 2D:
- 训练集: 93153
- 测试集: 22839
- 验证集: 4903
- SE Direct:
语言
- 文本语言: 英语



