five

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction - dataset

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/1212595
下载链接
链接失效反馈
官方服务:
资源简介:
Brief description The zip file contains two folders. The "websites" folder includes crawled web pages from real websites, like a agatameble.pl (an e-shop website), filmweb.pl (a website about films), and ptaki.info (a website about birds). The "reference-seeds" folder contains three subfolders, i.e. agatameble.pl, filmweb.pl, and ptaki.info. Each subfolder contains reference-seeds.csv file. The file contains data, i.e. reference instances - carefully labelled ground-truth of corresponding values in each web page of given websites mentioned above. Reference I would appreciate it if you cite the following paper when using the dataset: Marcin Mirończuk The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction, Knowledge and Information Systems, Volume 54, Issue 3, p. 711–776, 2018, (pdf Open Access – http://rdcu.be/u88F lub DOI http://dx.doi.org/10.1007/s10115-017-1097-2)
创建时间:
2020-01-24
二维码
社区交流群
二维码
科研交流群
商业服务