Structured Web Data Extraction Dataset (SWDE)

Name: Structured Web Data Extraction Dataset (SWDE)
Creator: academictorrents.com
License: 暂无描述

academictorrents.com2025-01-21 收录

下载链接：

https://academictorrents.com/details/411576c7e80787e4b40452360f5f24acba9b5159

下载链接

链接失效反馈

官方服务：

资源简介：

## Motivation This dataset is a real-world web page collection used for research on the automatic extraction of structured data (e.g., attribute-value pairs of entities) from the Web. We hope it could serve as a useful benchmark for evaluating and comparing different methods for structured web data extraction. ## Contents of the Dataset Currently the dataset involves: 8 verticals with diverse semantics; 80 web sites (10 per vertical); 124,291 web pages (200 ~ 2,000 per web site), each containing a single data record with detailed information of an entity; 32 attributes (3 ~ 5 per vertical) associated with carefully labeled ground-truth of corresponding values in each web page. The goal of structured data extraction is to automatically identify the values of these attributes from web pages. The involved verticals are summarized as follows: |Vertical |#Sites|#Pages|#Attributes|Attributes| |—————-|———|—————

动机本数据集为现实世界网页集合，旨在用于研究从网络中自动提取结构化数据（例如，实体的属性-值对）的技术。我们期望它能作为评估和比较不同结构化网络数据提取方法的有用基准。数据集内容如下：目前包含8个具有多样化语义的垂直领域；每个领域包含80个网站（每个领域10个）；共计124,291个网页（每个网站200至2,000个），每个网页包含单个数据记录，记录了实体的详细信息；每个网页与32个属性（每个领域3至5个）相关联，这些属性与精心标注的对应值具有精确匹配。结构化数据提取的目标是自动从网页中识别这些属性的值。涉及的垂直领域概要如下：|垂直领域 | #网站 | #页面 | #属性 | 属性 | —————- | ——— | —————

提供机构：

academictorrents.com

搜集汇总

数据集介绍

背景与挑战

背景概述

SWDE是一个用于结构化Web数据提取研究的大规模真实数据集，涵盖8个垂直领域（如汽车、书籍、相机等），包含80个网站的124,291个网页，每个网页对应一个实体的单条记录，并提供了32个属性的标注真值。该数据集旨在作为评估和比较不同结构化数据提取方法的基准，具有多领域语义多样性和详细的DOM节点级标注。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集