andreaparker/wiki-ss-corpus-train-sm-subset

Name: andreaparker/wiki-ss-corpus-train-sm-subset
Creator: andreaparker
Published: 2024-12-06 17:35:34
License: 暂无描述

Hugging Face2024-12-06 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/andreaparker/wiki-ss-corpus-train-sm-subset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从原本大小为360GB的`wiki-ss-corpus`（维基百科截图语料库）数据集中提取的1000条记录。数据集包含从维基百科页面抓取的截图，并经过一定的筛选以确保数据集质量。每条记录还包含了一些元数据，如文档ID。数据集特性包括图像、文档ID、文本和标题，数据集被分为训练集，包含1000个示例，下载大小为317919193字节，数据集大小为318879929.0字节。

This dataset is a subset consisting of 1000 records from the originally-sized 360GB `wiki-ss-corpus` (Wiki Screenshot corpus) dataset. The dataset consists of scraped screenshots of Wikipedia pages, some curation was done to the screenshots (to ensure dataset quality), and then a few metadata points such as the document id were given to each record. The dataset features include image, docid, text, and title. The dataset is split into a training set containing 1000 examples, with a download size of 317919193 bytes and a dataset size of 318879929.0 bytes.

提供机构：

andreaparker

5,000+

优质数据集

54 个

任务类型

进入经典数据集