andreaparker/wiki-ss-corpus-train-sm-subset
收藏Hugging Face2024-12-06 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/andreaparker/wiki-ss-corpus-train-sm-subset
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从原本大小为360GB的`wiki-ss-corpus`(维基百科截图语料库)数据集中提取的1000条记录。数据集包含从维基百科页面抓取的截图,并经过一定的筛选以确保数据集质量。每条记录还包含了一些元数据,如文档ID。数据集特性包括图像、文档ID、文本和标题,数据集被分为训练集,包含1000个示例,下载大小为317919193字节,数据集大小为318879929.0字节。
This dataset is a subset consisting of 1000 records from the originally-sized 360GB `wiki-ss-corpus` (Wiki Screenshot corpus) dataset. The dataset consists of scraped screenshots of Wikipedia pages, some curation was done to the screenshots (to ensure dataset quality), and then a few metadata points such as the document id were given to each record. The dataset features include image, docid, text, and title. The dataset is split into a training set containing 1000 examples, with a download size of 317919193 bytes and a dataset size of 318879929.0 bytes.
提供机构:
andreaparker



