struct-text
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/struct-text
下载链接
链接失效反馈官方服务:
资源简介:
# StructText — SEC_WikiDB & SEC_WikiDB_subset
*Dataset card for the VLDB 2025 TaDA-workshop submission “StructText: A
Synthetic Table-to-Text Approach for Benchmark Generation with
Multi-Dimensional Evaluation” (under review).*
```python
from datasets import load_dataset
# default = SEC_WikiDB_unfiltered_all
ds = load_dataset(
"ibm-research/struct-text",
trust_remote_code=True)
# a specific configuration
subset = load_dataset(
"ibm-research/struct-text",
"SEC_WikiDB_subset_unfiltered_planned",
trust_remote_code=True)
````
---
## 1 Dataset at a glance
| Family | Size (CSV files) | Split sizes (train/dev/test) | Notes |
| ----------------------- | ---------------- | ---------------------------- | ----------------------------------------------------- |
| **SEC\_WikiDB** | ≈ 1 000 | 80 % / 10 % / 10 % | Parsed from EDGAR 10-K / 10-Q filings + WikiDB tables |
| **SEC\_WikiDB\_subset** | 49 | 39 / 5 / 5 | Handy subset used in the paper |
Each split contains three *file types*:
| Suffix | Meaning |
| ---------------- | ------------------------------------------------------- |
| `_original.csv` | Raw structured data (columns + rows) |
| `*_generated_reports_*.csv` | Text generated from the table via Qwen-2-5-72B-Instruct |
| `*_report_types_*.csv` | Reference text produced by our planning module |
---
## 2 Folder layout
```
SEC_WikiDB/
├─ unfiltered/
│ ├─ train/ *_original.csv │ *_generated.csv │ *_planned.csv
│ ├─ dev/ …
│ └─ test/ …
└─ filtered/ # <- coming soon
SEC_WikiDB_subset/
├─ unfiltered/
│ ├─ train/ *_original.csv │ *_generated.csv │ *_planned.csv
│ ├─ dev/ …
│ └─ test/ …
└─ filtered/ # <- coming soon
```
The **loader** treats
`<family>_<filtered|unfiltered>_<all|original|generated|planned>`
as *configuration names*, e.g. `SEC_WikiDB_filtered_generated`.
---
## 3 Quick-start examples
```python
# full corpus, but original tables only
orig = load_dataset("ibm-research/struct-text",
"SEC_WikiDB_unfiltered_original",
trust_remote_code=True)
# data-frame reconstruction for one CSV file
import pandas as pd, io
ex = orig['test'][0]
df = pd.DataFrame(ex["rows"], columns=ex["columns"])
```
---
## 4 Dataset creation
* **WikiDB component** — Scraped via the method in Vogel et al. 2024 \[1].
* **SEC component** — Programmatic EDGAR queries (10-K/10-Q XML) → CSV.
* **Generation & planning** — Qwen-2-5-72B-Instruct + Two stage prompting for planning followed by report generation.
* **Filtering (ongoing)** — Unit-time accuracy threshold search (see paper §3.3).
---
## 5 Citation
```
@inproceedings{kashyap2025structtext,
title = {StructText: A Synthetic Table-to-Text Approach …},
author = {Satyananda Kashyap and Sola Shirai and
Nandana Mihindukulasooriya and Horst Samulowitz},
booktitle = {Proc.\ VLDB TaDA Workshop},
year = {2025},
note = {Accepted Oral}
}
```
**Sources**
1. Liane Vogel, Jan-Micha Bodensohn, Carsten Binnig.
*WikiDBs: A Large-Scale Corpus of Relational Databases from Wikidata.*
NeurIPS 2024 Datasets & Benchmarks Track.
2. *SEC EDGAR database.* [https://www.sec.gov/edgar](https://www.sec.gov/edgar)
# StructText — SEC_WikiDB 与 SEC_WikiDB_subset
*本数据集卡片对应VLDB 2025 TaDA研讨会投稿论文《StructText:面向多维度评估基准生成的合成表到文本方法》(审稿中)。*
python
from datasets import load_dataset
# default = SEC_WikiDB_unfiltered_all
ds = load_dataset(
"ibm-research/struct-text",
trust_remote_code=True)
# a specific configuration
subset = load_dataset(
"ibm-research/struct-text",
"SEC_WikiDB_subset_unfiltered_planned",
trust_remote_code=True)
---
## 1 数据集概览
| 数据集族 | CSV文件规模 | 划分比例(训练集/验证集/测试集) | 备注 |
| ----------------------- | ---------- | ---------------------------- | ----------------------------------------------------- |
| **SEC_WikiDB** | ≈ 1 000 | 80 % / 10 % / 10 % | 从EDGAR 10-K / 10-Q申报文件与WikiDB表中解析得到 |
| **SEC_WikiDB_subset** | 49 | 39 / 5 / 5 | 论文中使用的便捷子集 |
每个划分包含三类文件类型:
| 文件后缀 | 含义 |
| ---------------- | ------------------------------------------------------- |
| `_original.csv` | 原始结构化数据(含列名与行数据) |
| `*_generated_reports_*.csv` | 通过Qwen-2-5-72B-Instruct从表格生成的文本 |
| `*_report_types_*.csv` | 由我们的规划模块生成的参考文本 |
---
## 2 文件夹结构
SEC_WikiDB/
├─ unfiltered/
│ ├─ train/ *_original.csv │ *_generated.csv │ *_planned.csv
│ ├─ dev/ …
│ └─ test/ …
└─ filtered/ # <- 即将上线
SEC_WikiDB_subset/
├─ unfiltered/
│ ├─ train/ *_original.csv │ *_generated.csv │ *_planned.csv
│ ├─ dev/ …
│ └─ test/ …
└─ filtered/ # <- 即将上线
**数据集加载器**将格式为`<数据集族>_<过滤版|非过滤版>_<全量|原始数据|生成文本|规划参考>`的字符串视为配置名称,例如`SEC_WikiDB_filtered_generated`。
---
## 3 快速入门示例
python
# 完整语料库,仅保留原始表格
orig = load_dataset("ibm-research/struct-text",
"SEC_WikiDB_unfiltered_original",
trust_remote_code=True)
# 重构单个CSV文件的数据帧
import pandas as pd, io
ex = orig['test'][0]
df = pd.DataFrame(ex["rows"], columns=ex["columns"])
---
## 4 数据集构建
* **WikiDB模块** — 通过Vogel等人2024年的方法[1]爬取得到。
* **SEC模块** — 通过程序化EDGAR查询(10-K/10-Q XML文件)转换为CSV格式。
* **生成与规划模块** — 采用Qwen-2-5-72B-Instruct,结合两阶段提示:先规划再生成报告文本。
* **过滤流程(进行中)** — 单位时间准确率阈值搜索(详见论文§3.3)。
---
## 5 引用格式
@inproceedings{kashyap2025structtext,
title = {StructText: A Synthetic Table-to-Text Approach …},
author = {Satyananda Kashyap and Sola Shirai and
Nandana Mihindukulasooriya and Horst Samulowitz},
booktitle = {Proc. VLDB TaDA Workshop},
year = {2025},
note = {Accepted Oral}
}
**参考文献**
1. Liane Vogel、Jan-Micha Bodensohn、Carsten Binnig. *WikiDBs:基于Wikidata的大规模关系数据库语料库*,发表于NeurIPS 2024数据集与基准赛道。
2. *SEC EDGAR数据库*。[https://www.sec.gov/edgar](https://www.sec.gov/edgar)
提供机构:
maas
创建时间:
2025-10-12



