five

SemTabNet

收藏
魔搭社区2025-12-05 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/SemTabNet
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for SemTabNet This dataset accompanies the following [paper](https://arxiv.org/abs/2406.19102): ``` Title: Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs Authors: Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, Peter Staar Venue: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) ``` In this paper, we propose **STATEMENTS** as a new knowledge model for storing quantiative information in a domain agnotic, uniform structure. The task of converting a raw input (table or text) to Statements is called Statement Extraction (SE). The statement extraction task falls under the category of universal information extraction. - **Code Repository:** [SemTabNet repository](https://github.com/DS4SD/SemTabNet) - **Arxiv Paper:** [Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs](https://arxiv.org/abs/2406.19102) - **Point of Contact:** [IBM Research DeepSearch Team](https://ds4sd.github.io) ### Data Splits There are three tasks supported by this dataset. The data for each three task is split in training, validation, and testing set. Additionally, we also provide the original annotations of the raw tables which are used to construct all other data. |Task | Train | Test | Valid | | ----- | ------ | ----- | ---- | | SE Direct | 103455 | 11682 | 5445 | |SE Indirect 1D | 72580 | 8489 | 3821 | |SE Indirect 2D | 93153 | 22839 | 4903 | ### Languages The text in the dataset is in English. ### Source and Annotations The source of this dataset and the annotation strategy is described in the paper. ### Citation Information Arxiv: [https://arxiv.org/abs/2406.19102](https://arxiv.org/abs/2406.19102) ``` ```

# SemTabNet 数据集卡片 本数据集配套如下[论文](https://arxiv.org/abs/2406.19102): 标题:《陈述集:面向环境、社会及治理关键绩效指标(ESG KPIs)的大语言模型(Large Language Models)表格通用信息抽取》 作者:Lokesh Mishra、Sohayl Dhibi、Yusik Kim、Cesar Berrospi Ramis、Shubham Gupta、Michele Dolfi、Peter Staar 发表 venue:被第62届国际计算语言学协会(Association for Computational Linguistics, ACL 2024)年会的NLP4Climate工作坊收录 在本论文中,我们提出**STATEMENTS**作为一种全新的知识模型,可在领域无关的统一结构中存储量化信息。将原始输入(表格或文本)转换为STATEMENTS的任务被称为陈述抽取(Statement Extraction, SE),该任务属于通用信息抽取范畴。 - **代码仓库**:[SemTabNet代码仓库](https://github.com/DS4SD/SemTabNet) - **Arxiv论文**:[《陈述集:面向环境、社会及治理关键绩效指标(ESG KPIs)的大语言模型(Large Language Models)表格通用信息抽取》](https://arxiv.org/abs/2406.19102) - **联络方**:[IBM Research DeepSearch团队](https://ds4sd.github.io) ### 数据划分 本数据集支持三项任务,每项任务的数据均划分为训练集、验证集与测试集。此外,我们还提供了用于构建所有其他数据的原始表格标注集。 |任务类型 | 训练集 | 测试集 | 验证集 | | ----- | ------ | ----- | ---- | | 直接陈述抽取(SE Direct) | 103455 | 11682 | 5445 | | 一维间接陈述抽取(SE Indirect 1D) | 72580 | 8489 | 3821 | | 二维间接陈述抽取(SE Indirect 2D) | 93153 | 22839 | 4903 | ### 语言说明 本数据集内文本均为英文。 ### 来源与标注 本数据集的来源与标注策略详见本论文。 ### 引用信息 Arxiv:[https://arxiv.org/abs/2406.19102](https://arxiv.org/abs/2406.19102)
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作