Web2Text: Deep Structured Boilerplate Removal
收藏DataCite Commons2024-12-16 更新2025-04-16 收录
下载链接:
https://service.tib.eu/ldmservice/dataset/b07c7152-6d4e-470c-899e-50707cf0701d
下载链接
链接失效反馈官方服务:
资源简介:
Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content.
提供机构:
TIB
创建时间:
2024-12-16



