The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction - dataset

NIAID Data Ecosystem2026-03-11 收录

下载链接：

https://zenodo.org/record/1212595

下载链接

链接失效反馈

官方服务：

资源简介：

Brief description The zip file contains two folders. The "websites" folder includes crawled web pages from real websites, like a agatameble.pl (an e-shop website), filmweb.pl (a website about films), and ptaki.info (a website about birds). The "reference-seeds" folder contains three subfolders, i.e. agatameble.pl, filmweb.pl, and ptaki.info. Each subfolder contains reference-seeds.csv file. The file contains data, i.e. reference instances - carefully labelled ground-truth of corresponding values in each web page of given websites mentioned above. Reference I would appreciate it if you cite the following paper when using the dataset: Marcin Mirończuk The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction, Knowledge and Information Systems, Volume 54, Issue 3, p. 711–776, 2018, (pdf Open Access – http://rdcu.be/u88F lub DOI http://dx.doi.org/10.1007/s10115-017-1097-2)

创建时间：

2020-01-24