OmAlve/reading-steiner-data
收藏Hugging Face2026-04-27 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/OmAlve/reading-steiner-data
下载链接
链接失效反馈官方服务:
资源简介:
Reading Steiner数据集是一个用于网页内容提取任务的数据集,旨在训练模型识别网页中的相关内容块,过滤掉导航、广告、侧边栏等样板内容。数据集包含51,697个训练样本和1,417个评估样本,覆盖9,256多个独特域名。数据集支持两种提取任务:主内容提取和基于查询的提取。数据格式为ChatML格式的对话,每个样本包含系统、用户和助手的消息。数据来源包括Wikipedia、实时网页抓取和FineWeb数据集,覆盖新闻、科技、科学、食品、健康、体育、金融、政府、旅行、论坛、文档等多个领域。数据集还提供了详细的内容长度分布和领域类别信息,以及常见的样板模式。
The Reading Steiner dataset is designed for web content extraction tasks, aiming to train models to identify relevant content blocks in web pages while filtering out boilerplate such as navigation, ads, and sidebars. The dataset includes 51,697 training examples and 1,417 evaluation examples, covering over 9,256 unique domains. It supports two extraction tasks: main content extraction and query-based extraction. The data is formatted as ChatML conversations, with each example containing messages from the system, user, and assistant. Data sources include Wikipedia, live web scraping, and the FineWeb dataset, spanning diverse domains such as news, technology, science, food, health, sports, finance, government, travel, forums, and documentation. The dataset also provides detailed content length distributions, domain categories, and common boilerplate patterns.
提供机构:
OmAlve



