Team-PIXEL/rendered-wikipedia-english

Name: Team-PIXEL/rendered-wikipedia-english
Creator: Team-PIXEL
Published: 2022-08-02 14:01:21
License: 暂无描述

Hugging Face2022-08-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Team-PIXEL/rendered-wikipedia-english

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含2018年2月1日的完整英文维基百科，渲染为16x8464分辨率的图像。原始文本数据集来自维基百科的转储文件，每个示例包含一篇完整的维基百科文章，经过清理去除了标记和不需要的部分（如参考文献等）。渲染后的数据集包含11.4M个示例，总计约20亿个单词，存储为338个parquet文件。数据集用于训练PIXEL模型，该模型在论文《Language Modelling with Pixels》中介绍。数据集的每个示例包含一个16x8464的灰度图像和一个表示包含实际文本的图像块数量的整数值。

This dataset contains the complete English Wikipedia as of February 1, 2018, rendered into 16×8464 resolution images. The original text dataset is sourced from Wikipedia dumps, where each example consists of a full Wikipedia article that has been cleaned to remove markup and unwanted content such as references. The rendered dataset includes 11.4 million examples, totaling approximately 2 billion words, and is stored as 338 Parquet files. This dataset is used for training the PIXEL model, which is introduced in the paper *Language Modelling with Pixels*. Each example in the dataset contains a 16×8464 grayscale image and an integer value representing the number of actual text-containing image patches.

提供机构：

Team-PIXEL

原始信息汇总

数据集概述

基本信息

数据集名称: Team-PIXEL/rendered-wikipedia-english
语言: 英语 (en)
许可证: CC-BY-SA-3.0, GFDL
多语言性: 单语
数据集大小: 10M<n<100M
源数据集: 原始数据
任务类别: 掩码自编码, 渲染语言建模
任务ID: 掩码自编码, 渲染语言建模

数据集描述

数据集内容: 包含2018年2月1日的完整英文维基百科，渲染成16x8464分辨率的图像。
原始数据来源: 维基百科转储
数据处理: 原始文本数据经过清理，去除标记语言和不需要的部分（如参考文献等）。每个渲染示例包含一篇文章的部分内容。
数据集用途: 用于训练PIXEL模型，该模型在论文《Language Modelling with Pixels》中介绍。
数据集结构: 由338个parquet文件组成，包含约20亿个单词，共1140万个示例。

数据集结构

数据实例: 每个实例包含pixel_values（16x8464分辨率的灰度图像）和num_patches（图像中包含实际文本的补丁数量）。
数据字段:
- pixel_values: 图像特征
- num_patches: 整数特征
数据分割: 仅包含训练集，共11446535个实例。

数据集创建

渲染工具: 使用PyGame后端和合并的Google Noto Sans字体进行渲染。
渲染限制: 不支持复杂文本布局（如连字和从右到左的脚本）或表情符号，因此这些文本在维基百科数据中的渲染可能不准确。

使用指南

加载数据集: 使用datasets库加载数据集，支持下载到本地或直接从数据集中心流式加载。

许可证信息

文本和图像: 多数内容根据CC BY-SA 3.0和GFDL许可发布。
部分文本: 仅根据CC BY-SA许可发布，不能在GFDL下重用。

引用信息

bibtex @article{rust-etal-2022-pixel, title={Language Modelling with Pixels}, author={Phillip Rust and Jonas F. Lotz and Emanuele Bugliarello and Elizabeth Salesky and Miryam de Lhoneux and Desmond Elliott}, journal={arXiv preprint}, year={2022}, url={https://arxiv.org/abs/2207.06991} }

联系人

添加者: Phillip Rust
GitHub: @xplip
Twitter: @rust_phillip

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是2018年英文维基百科的渲染图像集合，分辨率为16x8464，包含11.4M样本约20亿单词，用于PIXEL模型训练。数据以338个parquet文件存储，总容量126GB。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集