The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction - dataset
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/1212595
下载链接
链接失效反馈官方服务:
资源简介:
Brief description
The zip file contains two folders. The "websites" folder includes crawled web pages from real websites, like a agatameble.pl (an e-shop website), filmweb.pl (a website about films), and ptaki.info (a website about birds). The "reference-seeds" folder contains three subfolders, i.e. agatameble.pl, filmweb.pl, and ptaki.info. Each subfolder contains reference-seeds.csv file. The file contains data, i.e. reference instances - carefully labelled ground-truth of corresponding values in each web page of given websites mentioned above.
Reference
I would appreciate it if you cite the following paper when using the dataset:
Marcin Mirończuk The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction, Knowledge and Information Systems, Volume 54, Issue 3, p. 711–776, 2018, (pdf Open Access – http://rdcu.be/u88F lub DOI http://dx.doi.org/10.1007/s10115-017-1097-2)
创建时间:
2020-01-24



