five

TabbyXL: Experiment Data

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/448jdx7gcr
下载链接
链接失效反馈
官方服务:
资源简介:
The data are designed to evaluate TabbyXL, a system for rule-based transformation spreadsheet data from arbitrary to relational tables that is freely available at GitHub (https://github.com/cellsrg/tabbyxl). Our data are based on the existing dataset of tables Troy_200 [1]. It contains 200 arbitrary tables as CSV files collected from 10 different government statistical websites. They were collected for the experiment on data extraction from tables that is presented in the paper [2]. We use its earlier version that stores the original tables with style features (fonts, alignment, and indentation) as Excel spreadsheets available at http://tango.byu.edu/data. We have put all of these tables with style features into the single spreadsheet file (data/TangoDataset.xlsx). Each of 200 tables is located in a separate sheet. The pair of tags $START and $END points out to its location inside the sheet. We initially used this file in our previous experiment described in the paper [3]. We have transformed automatically all tables of the single spreadsheet into the relational form, using TabbyXL and the ruleset (data/rules.dslr). The folder data/results contains the obtained results. The folder data/gt contains the ground-truth data for automated performance evaluation of TabbyXL in the role and structural stages of the table analysis. Each table of our data/results and data/gt dataset is accompanied with two recordsets: ENTRIES and LABELS. The first of them specifies entries. Each record presents an entry as a triple <value, provenance, set of associated labels>. In LABELS recordset each record presents a label as a triple <value, provenance, parent reference>. We also have stored the log files: results.log with the results of running and eval.log with the results of performance evaluation of TabbyXL. REFERENCES [1] Nagy G. TANGO-DocLab web tables from international statistical sites, (Troy_200), 1, ID: Troy_200_1. URL: http://tc11.cvc.uab.es/datasets/Troy_200_1. [2] Embley D., Krishnamoorthy M., Nagy G., & Seth S. (2016). Converting heterogeneous statistical tables on the web to searchable databases. Int. J. on Document Analysis and Recognition, 19(2), 119-138. URL: https://link.springer.com/article/10.1007/s10032-016-0259-1. [3] Shigarov A., Paramonov V., Belykh P., & Bondarev A. (2016) Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets. Proc. 22nd Int. Conf. on Information and Software Technologies, pp. 78-91. URL: http://link.springer.com/chapter/10.1007/978-3-319-46254-7_7.
创建时间:
2017-06-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作