TECO (TEmplate detection and COntent extraction benchmarks suite)

Name: TECO (TEmplate detection and COntent extraction benchmarks suite)
Creator: 瓦伦西亚理工大学计算机系统与计算系
Published: 2022-07-17 09:22:34
License: 暂无描述

arXiv2022-07-17 更新2024-06-21 收录

下载链接：

http://www.dsic.upv.es/~jsilva/retrieval/teco

下载链接

链接失效反馈

官方服务：

资源简介：

TECO数据集是由瓦伦西亚理工大学计算机系统与计算系创建的，专门设计用于模板检测和内容提取的基准套件。该数据集包含150个来自互联网的真实异构网站，涵盖博客、公司、论坛等多种类型。数据集的创建过程涉及手动标记HTML元素，以区分模板和主要内容，并包括自动化脚本以支持基准测试过程。TECO数据集主要应用于网页分析，特别是模板检测、内容提取和菜单检测等领域，旨在优化网页内容的处理和存储，提高信息检索效率。

The TECO Dataset was developed by the Department of Computer Systems and Computing, Universitat Politècnica de València, and is a benchmark suite specifically designed for template detection and content extraction. This dataset contains 150 real-world heterogeneous websites sourced from the Internet, covering diverse types including blogs, corporate websites, forums, and more. The development process of the TECO Dataset involves manually annotating HTML elements to differentiate between templates and primary content, and also includes automated scripts to support the benchmarking workflow. Primarily applied in web page analysis, the TECO Dataset targets scenarios such as template detection, content extraction, and menu detection, aiming to optimize the processing and storage of web content and enhance information retrieval efficiency.

提供机构：

瓦伦西亚理工大学计算机系统与计算系

创建时间：

2014-09-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集