SemTabNet
收藏魔搭社区2025-12-05 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/SemTabNet
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for SemTabNet
This dataset accompanies the following [paper](https://arxiv.org/abs/2406.19102):
```
Title: Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs
Authors: Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, Peter Staar
Venue: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
```
In this paper, we propose **STATEMENTS** as a new knowledge model for storing quantiative information in a domain agnotic, uniform structure. The task of converting a raw input (table or text) to Statements is called Statement Extraction (SE). The statement extraction task falls under the category of universal information extraction.
- **Code Repository:** [SemTabNet repository](https://github.com/DS4SD/SemTabNet)
- **Arxiv Paper:** [Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs](https://arxiv.org/abs/2406.19102)
- **Point of Contact:** [IBM Research DeepSearch Team](https://ds4sd.github.io)
### Data Splits
There are three tasks supported by this dataset. The data for each three task is split in training, validation, and testing set. Additionally, we also provide the original annotations of the raw tables which are used to construct all other data.
|Task | Train | Test | Valid |
| ----- | ------ | ----- | ---- |
| SE Direct | 103455 | 11682 | 5445 |
|SE Indirect 1D | 72580 | 8489 | 3821 |
|SE Indirect 2D | 93153 | 22839 | 4903 |
### Languages
The text in the dataset is in English.
### Source and Annotations
The source of this dataset and the annotation strategy is described in the paper.
### Citation Information
Arxiv: [https://arxiv.org/abs/2406.19102](https://arxiv.org/abs/2406.19102)
```
```
# SemTabNet 数据集卡片
本数据集配套如下[论文](https://arxiv.org/abs/2406.19102):
标题:《陈述集:面向环境、社会及治理关键绩效指标(ESG KPIs)的大语言模型(Large Language Models)表格通用信息抽取》
作者:Lokesh Mishra、Sohayl Dhibi、Yusik Kim、Cesar Berrospi Ramis、Shubham Gupta、Michele Dolfi、Peter Staar
发表 venue:被第62届国际计算语言学协会(Association for Computational Linguistics, ACL 2024)年会的NLP4Climate工作坊收录
在本论文中,我们提出**STATEMENTS**作为一种全新的知识模型,可在领域无关的统一结构中存储量化信息。将原始输入(表格或文本)转换为STATEMENTS的任务被称为陈述抽取(Statement Extraction, SE),该任务属于通用信息抽取范畴。
- **代码仓库**:[SemTabNet代码仓库](https://github.com/DS4SD/SemTabNet)
- **Arxiv论文**:[《陈述集:面向环境、社会及治理关键绩效指标(ESG KPIs)的大语言模型(Large Language Models)表格通用信息抽取》](https://arxiv.org/abs/2406.19102)
- **联络方**:[IBM Research DeepSearch团队](https://ds4sd.github.io)
### 数据划分
本数据集支持三项任务,每项任务的数据均划分为训练集、验证集与测试集。此外,我们还提供了用于构建所有其他数据的原始表格标注集。
|任务类型 | 训练集 | 测试集 | 验证集 |
| ----- | ------ | ----- | ---- |
| 直接陈述抽取(SE Direct) | 103455 | 11682 | 5445 |
| 一维间接陈述抽取(SE Indirect 1D) | 72580 | 8489 | 3821 |
| 二维间接陈述抽取(SE Indirect 2D) | 93153 | 22839 | 4903 |
### 语言说明
本数据集内文本均为英文。
### 来源与标注
本数据集的来源与标注策略详见本论文。
### 引用信息
Arxiv:[https://arxiv.org/abs/2406.19102](https://arxiv.org/abs/2406.19102)
提供机构:
maas
创建时间:
2025-01-20



