Multilingual datasets for Main content extraction from web pages
收藏DataCite Commons2022-06-23 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/multilingual-datasets-main-content-extraction-web-pages
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.
This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).
Releated Resources:
- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework
- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2
提供机构:
IEEE DataPort
创建时间:
2022-06-23



