five

struct-text

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/struct-text
下载链接
链接失效反馈
官方服务:
资源简介:
# StructText — SEC_WikiDB & SEC_WikiDB_subset *Dataset card for the VLDB 2025 TaDA-workshop submission “StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation” (under review).* ```python from datasets import load_dataset # default = SEC_WikiDB_unfiltered_all ds = load_dataset( "ibm-research/struct-text", trust_remote_code=True) # a specific configuration subset = load_dataset( "ibm-research/struct-text", "SEC_WikiDB_subset_unfiltered_planned", trust_remote_code=True) ```` --- ## 1 Dataset at a glance | Family | Size (CSV files) | Split sizes (train/dev/test) | Notes | | ----------------------- | ---------------- | ---------------------------- | ----------------------------------------------------- | | **SEC\_WikiDB** | ≈ 1 000 | 80 % / 10 % / 10 % | Parsed from EDGAR 10-K / 10-Q filings + WikiDB tables | | **SEC\_WikiDB\_subset** | 49 | 39 / 5 / 5 | Handy subset used in the paper | Each split contains three *file types*: | Suffix | Meaning | | ---------------- | ------------------------------------------------------- | | `_original.csv` | Raw structured data (columns + rows) | | `*_generated_reports_*.csv` | Text generated from the table via Qwen-2-5-72B-Instruct | | `*_report_types_*.csv` | Reference text produced by our planning module | --- ## 2 Folder layout ``` SEC_WikiDB/ ├─ unfiltered/ │ ├─ train/ *_original.csv │ *_generated.csv │ *_planned.csv │ ├─ dev/ … │ └─ test/ … └─ filtered/ # <- coming soon SEC_WikiDB_subset/ ├─ unfiltered/ │ ├─ train/ *_original.csv │ *_generated.csv │ *_planned.csv │ ├─ dev/ … │ └─ test/ … └─ filtered/ # <- coming soon ``` The **loader** treats `<family>_<filtered|unfiltered>_<all|original|generated|planned>` as *configuration names*, e.g. `SEC_WikiDB_filtered_generated`. --- ## 3 Quick-start examples ```python # full corpus, but original tables only orig = load_dataset("ibm-research/struct-text", "SEC_WikiDB_unfiltered_original", trust_remote_code=True) # data-frame reconstruction for one CSV file import pandas as pd, io ex = orig['test'][0] df = pd.DataFrame(ex["rows"], columns=ex["columns"]) ``` --- ## 4 Dataset creation * **WikiDB component** — Scraped via the method in Vogel et al. 2024 \[1]. * **SEC component** — Programmatic EDGAR queries (10-K/10-Q XML) → CSV. * **Generation & planning** — Qwen-2-5-72B-Instruct + Two stage prompting for planning followed by report generation. * **Filtering (ongoing)** — Unit-time accuracy threshold search (see paper §3.3). --- ## 5 Citation ``` @inproceedings{kashyap2025structtext, title = {StructText: A Synthetic Table-to-Text Approach …}, author = {Satyananda Kashyap and Sola Shirai and Nandana Mihindukulasooriya and Horst Samulowitz}, booktitle = {Proc.\ VLDB TaDA Workshop}, year = {2025}, note = {Accepted Oral} } ``` **Sources** 1. Liane Vogel, Jan-Micha Bodensohn, Carsten Binnig. *WikiDBs: A Large-Scale Corpus of Relational Databases from Wikidata.* NeurIPS 2024 Datasets & Benchmarks Track. 2. *SEC EDGAR database.* [https://www.sec.gov/edgar](https://www.sec.gov/edgar)

# StructText — SEC_WikiDB 与 SEC_WikiDB_subset *本数据集卡片对应VLDB 2025 TaDA研讨会投稿论文《StructText:面向多维度评估基准生成的合成表到文本方法》(审稿中)。* python from datasets import load_dataset # default = SEC_WikiDB_unfiltered_all ds = load_dataset( "ibm-research/struct-text", trust_remote_code=True) # a specific configuration subset = load_dataset( "ibm-research/struct-text", "SEC_WikiDB_subset_unfiltered_planned", trust_remote_code=True) --- ## 1 数据集概览 | 数据集族 | CSV文件规模 | 划分比例(训练集/验证集/测试集) | 备注 | | ----------------------- | ---------- | ---------------------------- | ----------------------------------------------------- | | **SEC_WikiDB** | ≈ 1 000 | 80 % / 10 % / 10 % | 从EDGAR 10-K / 10-Q申报文件与WikiDB表中解析得到 | | **SEC_WikiDB_subset** | 49 | 39 / 5 / 5 | 论文中使用的便捷子集 | 每个划分包含三类文件类型: | 文件后缀 | 含义 | | ---------------- | ------------------------------------------------------- | | `_original.csv` | 原始结构化数据(含列名与行数据) | | `*_generated_reports_*.csv` | 通过Qwen-2-5-72B-Instruct从表格生成的文本 | | `*_report_types_*.csv` | 由我们的规划模块生成的参考文本 | --- ## 2 文件夹结构 SEC_WikiDB/ ├─ unfiltered/ │ ├─ train/ *_original.csv │ *_generated.csv │ *_planned.csv │ ├─ dev/ … │ └─ test/ … └─ filtered/ # <- 即将上线 SEC_WikiDB_subset/ ├─ unfiltered/ │ ├─ train/ *_original.csv │ *_generated.csv │ *_planned.csv │ ├─ dev/ … │ └─ test/ … └─ filtered/ # <- 即将上线 **数据集加载器**将格式为`<数据集族>_<过滤版|非过滤版>_<全量|原始数据|生成文本|规划参考>`的字符串视为配置名称,例如`SEC_WikiDB_filtered_generated`。 --- ## 3 快速入门示例 python # 完整语料库,仅保留原始表格 orig = load_dataset("ibm-research/struct-text", "SEC_WikiDB_unfiltered_original", trust_remote_code=True) # 重构单个CSV文件的数据帧 import pandas as pd, io ex = orig['test'][0] df = pd.DataFrame(ex["rows"], columns=ex["columns"]) --- ## 4 数据集构建 * **WikiDB模块** — 通过Vogel等人2024年的方法[1]爬取得到。 * **SEC模块** — 通过程序化EDGAR查询(10-K/10-Q XML文件)转换为CSV格式。 * **生成与规划模块** — 采用Qwen-2-5-72B-Instruct,结合两阶段提示:先规划再生成报告文本。 * **过滤流程(进行中)** — 单位时间准确率阈值搜索(详见论文§3.3)。 --- ## 5 引用格式 @inproceedings{kashyap2025structtext, title = {StructText: A Synthetic Table-to-Text Approach …}, author = {Satyananda Kashyap and Sola Shirai and Nandana Mihindukulasooriya and Horst Samulowitz}, booktitle = {Proc. VLDB TaDA Workshop}, year = {2025}, note = {Accepted Oral} } **参考文献** 1. Liane Vogel、Jan-Micha Bodensohn、Carsten Binnig. *WikiDBs:基于Wikidata的大规模关系数据库语料库*,发表于NeurIPS 2024数据集与基准赛道。 2. *SEC EDGAR数据库*。[https://www.sec.gov/edgar](https://www.sec.gov/edgar)
提供机构:
maas
创建时间:
2025-10-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作