websrc
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/rootsautomation/websrc
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for WebSRC v1.0
WebSRC v1.0 is a dataset for reading comprehension on structural web pages.
The task is to answer questions about web pages, which requires a system to have a comprehensive understanding of the spatial structure and logical structure.
WebSRC consists of 6.4K web pages and 400K question-answer pairs about web pages.
This cached copy of the dataset is focused on Q&A using the web screenshots (HTML and other metadata are omitted).
Questions in WebSRC were created for each segment.
Answers are either text spans from web pages or yes/no.
For more details, please refer to the paper [WebSRC: A Dataset for Web-Based Structural Reading Comprehension](https://arxiv.org/abs/2101.09465). The Leaderboard of WebSRC v1.0 can be found [here](https://x-lance.github.io/WebSRC/#).
The original [GitHub Repo](https://github.com/X-LANCE/WebSRC-Baseline/tree/master?tab=readme-ov-file) is also available.
This flat version of the dataset was specifically compiled to aid Large Multimodal Model (LMM) development, especially in digital domains that need to reason about screens.
## Structure
- `domain`: str, broad category of the website
- `page_id`: str, unique ID for the particular page
- `question`: str, the question to answer
- `answer`: str, the actual answer
- `image`: str, a base64 encoded binary string of the image.
The `image` is converted back to a PIL.Image with:
```python
import base64
import io
def decode_base64_to_image(base64_string):
img_data = base64.b64decode(base64_string)
img = Image.open(io.BytesIO(img_data))
return img
```
## Data Statistics
Questions are roughly divided into three categories: KV,
Compare, and Table. The detailed definitions can be found in the original
[paper](https://arxiv.org/abs/2101.09465). The numbers of websites, webpages,
and QAs corresponding to the three categories are as follows:
Type | # Websites | # Webpages | # QAs
---- | ---------- | ---------- | -----
KV | 34 | 3,207 | 168,606
Comparison | 15 | 1,339 | 68,578
Table | 21 | 1,901 | 163,314
The statistics of the dataset splits are as follows:
Split | # Websites | # Webpages | # QAs
----- | ---------- | ---------- | -----
Train | 50 | 4,549 | 307,315
Dev | 10 | 913 | 52,826
Test | 10 | 985 | 40,357
Note: The test split is _not_ included in this upload. See the original repo for compiling the test set, and how to obtain scores for the test split via submission.
## Reference
If you use any source codes or datasets included in this repository in your work,
please cite the corresponding papers. The bibtex are listed below:
```text
@inproceedings{chen-etal-2021-websrc,
title = "{W}eb{SRC}: A Dataset for Web-Based Structural Reading Comprehension",
author = "Chen, Xingyu and
Zhao, Zihan and
Chen, Lu and
Ji, JiaBao and
Zhang, Danyang and
Luo, Ao and
Xiong, Yuxuan and
Yu, Kai",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.343",
pages = "4173--4185",
abstract = "Web search is an essential way for humans to obtain information, but it{'}s still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of web-based structural reading comprehension. Given a web page and a question about it, the task is to find an answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We evaluate various strong baselines on our dataset to show the difficulty of our task. We also investigate the usefulness of structural information and visual features. Our dataset and baselines have been publicly available.",
}
```
## Dataset Compilation
Hunter Heidenreich; hunter (DOT) heidenreich _at_ rootsautomation *dot* com
# WebSRC v1.0 数据集卡片
WebSRC v1.0是一款面向结构化网页的阅读理解数据集。
其任务为回答与网页相关的问题,这要求系统能够全面理解网页的空间结构与逻辑结构。
WebSRC包含6400个网页以及40万个针对该类网页的问答对。
本缓存版数据集聚焦于基于网页截图的问答任务(已省略HTML及其他元数据)。
WebSRC中的问题均针对网页各分段生成。
答案既可以是网页中的文本片段,也可以是“是/否”类型的结果。
如需了解更多细节,请参阅论文《WebSRC: A Dataset for Web-Based Structural Reading Comprehension》(链接:https://arxiv.org/abs/2101.09465)。WebSRC v1.0的排行榜可参见[此处](https://x-lance.github.io/WebSRC/#)。原始[GitHub仓库](https://github.com/X-LANCE/WebSRC-Baseline/tree/master?tab=readme-ov-file)亦可访问。
本扁平化版本数据集专为辅助大型多模态模型(Large Multimodal Model, LMM)开发而构建,尤其适用于需要针对屏幕内容进行推理的数字场景。
## 数据结构
- `domain`:字符串类型,表示网站的宽泛分类
- `page_id`:字符串类型,对应特定网页的唯一标识符
- `question`:字符串类型,需要解答的问题
- `answer`:字符串类型,实际答案
- `image`:字符串类型,经过base64编码的图像二进制字符串。
可通过如下Python代码将`image`字段还原为PIL.Image图像对象:
python
import base64
import io
def decode_base64_to_image(base64_string):
img_data = base64.b64decode(base64_string)
img = Image.open(io.BytesIO(img_data))
return img
## 数据统计
问题大致分为三类:键值对(KV)、对比类(Compare)与表格类(Table)。详细定义可参阅原始[论文](https://arxiv.org/abs/2101.09465)。三类任务对应的网站数量、网页数量与问答对数量如下表所示:
| 类型 | 网站数量 | 网页数量 | 问答对数量 |
| ---- | -------- | -------- | ---------- |
| KV | 34 | 3,207 | 168,606 |
| 对比类 | 15 | 1,339 | 68,578 |
| 表格类 | 21 | 1,901 | 163,314 |
数据集划分的统计信息如下:
| 划分集 | 网站数量 | 网页数量 | 问答对数量 |
| ----- | -------- | -------- | ---------- |
| 训练集(Train) | 50 | 4,549 | 307,315 |
| 验证集(Dev) | 10 | 913 | 52,826 |
| 测试集(Test) | 10 | 985 | 40,357 |
注:本次上传未包含测试集划分。如需构建测试集并通过提交结果获取测试集评分,请参阅原始仓库文档。
## 引用
若您在工作中使用本仓库包含的任何源代码或数据集,请引用对应的论文。BibTeX引用格式如下:
text
@inproceedings{chen-etal-2021-websrc,
title = "{W}eb{SRC}: A Dataset for Web-Based Structural Reading Comprehension",
author = "Chen, Xingyu and
Zhao, Zihan and
Chen, Lu and
Ji, JiaBao and
Zhang, Danyang and
Luo, Ao and
Xiong, Yuxuan and
Yu, Kai",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = 2021,
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.343",
pages = "4173--4185",
abstract = "Web search is an essential way for humans to obtain information, but it's still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of web-based structural reading comprehension. Given a web page and a question about it, the task is to find an answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We evaluate various strong baselines on our dataset to show the difficulty of our task. We also investigate the usefulness of structural information and visual features. Our dataset and baselines have been publicly available.",
}
## 数据集编译
Hunter Heidenreich;联系方式:hunter (DOT) heidenreich _at_ rootsautomation *dot* com
提供机构:
maas
创建时间:
2025-10-14



