five

Gutenberg Dataset

收藏
Figshare2023-11-30 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Gutenberg_Dataset/24574753
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset containing the images and labels for the Gutenberg data used in the CVPR NAS workshop Unseen-data challenge under the codename "Gutenberg", which we decided to keep as the official name.The Gutenberg dataset is constructed dataset containing phrases from famous literary works that have been made available by project Gutenberg (www.gutenberg.org) which provides free ebooks of literary works that are no longer under US copyright protection*. Given the name is descriptive of its content, we decided to keep the code name Gutenberg we used in the competition.This dataset was created by accessing several works by six popular authors (see label mapping below). The works downloaded were English translations chosen to represent a variety of cultures and time periods. We performed basic text preprocessing over each text, removing punctuation, converting letters with diacritics to the base letter, and removing "structure" words (e.g., 'Chapter', 'Scene', 'Prologue'). We then extracted consecutive sequences of three words between 3 and 6 letters long. In each sequence, the three words were padded up to 6 characters with spaces. Then the three words were concatenated together to produce an 18-character string. These strings were used as the base for image creation. Training, test, and validation sequences were chosen such that there was no overlap between any sequence across any data split.These strings are encoded into images, using a graph, the x-axis represents each characters index in the 18-character long strings, and the y-axis represents the corresponding letter (the axis is arranged alphabetically A-Z with a space being represented underneath the Z)The data is in a channels-first format with a shape of (n, 1, 27, 18) where n is the number of samples in the corresponding set (45,000 for training, 15,000 for validation, and 6,000 for testing).There are six classes in the dataset, with 11,000 examples of each, distributed evenly between the three subsets.The six classes and corresponding numerical label are as follows:aquinas: 0,confucius: 1,hawthorne: 2,plato: 3, shakespeare: 4, tolstoy: 5*These works fall into the public domain under US copyright law, please check whether the works are available under your country's copyright laws.
创建时间:
2023-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作