five

SALT

收藏
魔搭社区2025-12-05 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/SAP/SALT
下载链接
链接失效反馈
官方服务:
资源简介:
# SALT: Sales Autocompletion Linked Business Tables Dataset [![License](https://img.shields.io/badge/license-CC--BY--NC--SA--4.0-blue)]()[![arXiv](https://img.shields.io/badge/arXiv-2501.03413-29d634.svg)](https://arxiv.org/abs/2501.03413) Dataset for our paper [**SALT: Sales Autocompletion Linked Business Tables Dataset**](https://openreview.net/forum?id=UZbELpkWIr#discussion) presented at [NeurIPS'24 Table Representation Workshop](https://table-representation-learning.github.io/). #### News - **07/10/2025: 🎉🎉🎉 Dataset is now integrated into [RelBench](https://github.com/snap-stanford/relbench) 🎉🎉🎉** - 01/11/2025: Updated paper (some results changed due to minor dataset changes, screenshots added to appendix) - 12/19/2024: Train/test splits released - 12/15/2024: Preliminatry dataset now also available on Hugging Face. - 12/13/2024: Provided data - 10/29/2024: Preliminary repository created ### Abstract Foundation models, particularly those that incorporate Transformer architectures, have demonstrated exceptional performance in domains such as natural language processing and image processing. Adapting these models to structured data, like tables, however, introduces significant challenges. These difficulties are even more pronounced when addressing multi-table data linked via foreign key, which is prevalent in the enterprise realm and crucial for empowering business use cases. Despite its substantial impact, research focusing on such linked business tables within enterprise settings remains a significantly important yet underexplored domain. To address this, we introduce a curated dataset sourced from an Enterprise Resource Planning (ERP) system, featuring extensive linked tables. This dataset is specifically designed to support research endeavors in table representation learning. By providing access to authentic enterprise data, our goal is to potentially enhance the effectiveness and applicability of models for real-world business contexts. ### Information ![Table Schema of SALT Dataset](https://raw.githubusercontent.com/SAP-samples/salt/master/images/schema.svg "SALT Schema") *Table Schema of SALT Dataset* ![Screenshot of a Salesorder Input Mask](https://raw.githubusercontent.com/SAP-samples/salt/master/images/SAP_S4HANA_SalesOrder_App.png "Salesorder Input Mask") *Example Input Mask of a Salesorder App using SAP S/4HANA* ### Usage Please mind that the following 8 fields are meant to be used as classification labels (all other fields can be used as input features): - `SALESOFFICE` - `SALESGROUP` - `CUSTOMERPAYMENTTERMS` - `SHIPPINGCONDITION` - `SHIPPINGPOINT` - `PLANT` - `HEADERINCOTERMSCLASSIFICATION` - `ITEMINCOTERMSCLASSIFICATION` #### Example of loading the tables with Hugging Face datasets Unless `datasets` library is already installed, install it with: ```bash pip install datasets ``` ```python from datasets import load_dataset dataset_name = "sap-ai-research/SALT" split = "train" # use "train" or "test" salesdocuments = load_dataset(dataset_name, "salesdocuments", split=split) salesdocument_items = load_dataset(dataset_name, "salesdocument_items", split=split) customers = load_dataset(dataset_name, "customers", split=split) addresses = load_dataset(dataset_name, "addresses", split=split) # you can also load the joined table which combines the four tables in one joined_table = load_dataset(dataset_name, "joined_table", split=split) # separate the input features and target fields target_fields = ["SALESOFFICE", "SALESGROUP", "CUSTOMERPAYMENTTERMS", "SHIPPINGCONDITION", "SHIPPINGPOINT", "PLANT", "HEADERINCOTERMSCLASSIFICATION", "ITEMINCOTERMSCLASSIFICATION"] dataframe = joined_table.to_pandas() X = dataframe[[c for c in dataframe.columns if c not in target_fields]] y = dataframe[target_fields] ``` #### Example of loading the tables with pandas Unless `pandas` library is already installed, install it with: ```bash pip install pandas ``` ```python import pandas as pd # load the table data from the parquet files salesdocuments = pd.read_parquet("I_SalesDocument_train.parquet") salesdocument_items = pd.read_parquet("I_SalesDocumentItem_train.parquet") customers = pd.read_parquet("I_Customer.parquet") addresses = pd.read_parquet("I_AddrOrgNamePostalAddress.parquet") joined = pd.read_parquet("JoinedTables_train.parquet") # show the first elements salesdocuments.head() ``` ### Authors: - [Tassilo Klein](https://tjklein.github.io/) - [Clemens Biehl](https://www.linkedin.com/in/clemens-biehl-43a39a117/) - [Margarida Costa](https://www.linkedin.com/in/mariamargaridacosta/) - [André Sreš](https://www.linkedin.com/in/andr%C3%A9-sre%C5%A1-937096160/) - [Jonas Kolk](https://www.linkedin.com/in/jonas-kolk-b8a94b123/) - [Johannes Hoffart](https://www.linkedin.com/in/johanneshoffart/) ## Citations If you use this dataset in your research or want to refer to our work, please cite: ``` @inproceedings{ klein2024salt, title={{SALT}: Sales Autocompletion Linked Business Tables Dataset}, author={Tassilo Klein and Clemens Biehl and Margarida Costa and Andre Sres and Jonas Kolk and Johannes Hoffart}, booktitle={NeurIPS 2024 Third Table Representation Learning Workshop}, year={2024}, url={https://openreview.net/forum?id=UZbELpkWIr} } ``` ## Roadmap - [x] Integration into [RelBench](https://relbench.stanford.edu/) - [x] Release dataset

# SALT:销售自动补全关联业务表格数据集 [![License](https://img.shields.io/badge/license-CC--BY--NC--SA--4.0-blue)]()[![arXiv](https://img.shields.io/badge/arXiv-2501.03413-29d634.svg)](https://arxiv.org/abs/2501.03413) 本数据集配套于发表于[NeurIPS 2024表格表示学习研讨会](https://table-representation-learning.github.io/)的论文[**SALT:销售自动补全关联业务表格数据集**](https://openreview.net/forum?id=UZbELpkWIr#discussion)。 #### 动态 - **2025/07/10:🎉🎉🎉 数据集现已集成至[RelBench](https://github.com/snap-stanford/relbench) 🎉🎉🎉** - 2025/01/11:更新论文(因数据集微调调整了部分结果,附录新增截图) - 2024/12/19:发布训练/测试划分集 - 2024/12/15:初步版本数据集现已同步至Hugging Face平台 - 2024/12/13:开放原始数据下载 - 2024/10/29:创建初始代码仓库 ### 摘要 基座模型,尤其是集成Transformer架构的模型,在自然语言处理、图像处理等领域已展现出卓越性能。然而,将此类模型适配结构化数据(如表格)仍面临诸多显著挑战。针对通过外键关联的多表格数据——此类数据在企业场景中极为常见,且对赋能业务用例至关重要——的处理难度更是陡增。尽管该方向具有重大应用价值,但聚焦企业场景下此类关联业务表格的研究仍属亟待探索的重要领域。 为此,我们发布了一套源自企业资源计划(Enterprise Resource Planning, ERP)系统的精选数据集,其中包含大量关联表格。本数据集专为表格表示学习领域的研究工作设计,通过提供真实的企业级数据,旨在提升模型在真实业务场景中的有效性与适用性。 ### 数据集信息 ![SALT数据集表格结构示意图](https://raw.githubusercontent.com/SAP-samples/salt/master/images/schema.svg "SALT Schema") *SALT数据集表格结构示意图* ![销售订单输入掩码示例](https://raw.githubusercontent.com/SAP-samples/salt/master/images/SAP_S4HANA_SalesOrder_App.png "Salesorder Input Mask") *使用SAP S/4HANA的销售订单应用的输入掩码示例* ### 使用说明 请注意,以下8个字段将作为分类标签使用(其余所有字段均可作为输入特征): - `SALESOFFICE`(销售办事处) - `SALESGROUP`(销售组) - `CUSTOMERPAYMENTTERMS`(客户付款条款) - `SHIPPINGCONDITION`(运输条件) - `SHIPPINGPOINT`(发货点) - `PLANT`(工厂) - `HEADERINCOTERMSCLASSIFICATION`(抬头国际贸易术语解释通则分类) - `ITEMINCOTERMSCLASSIFICATION`(项目国际贸易术语解释通则分类) #### 使用Hugging Face Datasets加载表格的示例 若尚未安装`datasets`库,请通过以下命令安装: bash pip install datasets python from datasets import load_dataset dataset_name = "sap-ai-research/SALT" split = "train" # 可选值为 "train" 或 "test" salesdocuments = load_dataset(dataset_name, "salesdocuments", split=split) salesdocument_items = load_dataset(dataset_name, "salesdocument_items", split=split) customers = load_dataset(dataset_name, "customers", split=split) addresses = load_dataset(dataset_name, "addresses", split=split) # 你也可以加载合并了四个表格的联合表格 joined_table = load_dataset(dataset_name, "joined_table", split=split) # 分离输入特征与目标字段 target_fields = ["SALESOFFICE", "SALESGROUP", "CUSTOMERPAYMENTTERMS", "SHIPPINGCONDITION", "SHIPPINGPOINT", "PLANT", "HEADERINCOTERMSCLASSIFICATION", "ITEMINCOTERMSCLASSIFICATION"] dataframe = joined_table.to_pandas() X = dataframe[[c for c in dataframe.columns if c not in target_fields]] y = dataframe[target_fields] #### 使用Pandas加载表格的示例 若尚未安装`pandas`库,请通过以下命令安装: bash pip install pandas python import pandas as pd # 从Parquet文件加载表格数据 salesdocuments = pd.read_parquet("I_SalesDocument_train.parquet") salesdocument_items = pd.read_parquet("I_SalesDocumentItem_train.parquet") customers = pd.read_parquet("I_Customer.parquet") addresses = pd.read_parquet("I_AddrOrgNamePostalAddress.parquet") joined = pd.read_parquet("JoinedTables_train.parquet") # 展示前几条数据 salesdocuments.head() ### 作者 - [塔西洛·克莱因(Tassilo Klein)](https://tjklein.github.io/) - [克莱门斯·比尔(Clemens Biehl)](https://www.linkedin.com/in/clemens-biehl-43a39a117/) - [玛格丽达·科斯塔(Margarida Costa)](https://www.linkedin.com/in/mariamargaridacosta/) - [安德烈·什雷什(André Sreš)](https://www.linkedin.com/in/andr%C3%A9-sre%C5%A1-937096160/) - [乔纳斯·科尔克(Jonas Kolk)](https://www.linkedin.com/in/jonas-kolk-b8a94b123/) - [约翰内斯·霍法特(Johannes Hoffart)](https://www.linkedin.com/in/johanneshoffart/) ## 引用 若您在研究中使用本数据集或参考我们的工作,请引用以下文献: @inproceedings{ klein2024salt, title={{SALT}: Sales Autocompletion Linked Business Tables Dataset}, author={Tassilo Klein and Clemens Biehl and Margarida Costa and Andre Sres and Jonas Kolk and Johannes Hoffart}, booktitle={NeurIPS 2024 Third Table Representation Learning Workshop}, year={2024}, url={https://openreview.net/forum?id=UZbELpkWIr} } ## 开发路线图 - [x] 集成至[RelBench](https://relbench.stanford.edu/) - [x] 发布数据集
提供机构:
maas
创建时间:
2025-11-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作