Tanushreeeeee/pg19
收藏Hugging Face2025-12-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Tanushreeeeee/pg19
下载链接
链接失效反馈官方服务:
资源简介:
PG-19是一个语言建模基准数据集,包含从Project Gutenberg图书馆提取的1919年以前出版的书籍。数据集包括书籍的标题、出版日期和文本内容。PG-19的规模是Billion Word基准的两倍多,文档平均长度是WikiText长距离语言建模基准的20倍。数据集被划分为训练集、验证集和测试集,且以开放词汇表的形式发布,没有限制词汇量大小。数据集适用于长距离语言模型的基准测试,或用于预训练其他需要长距离推理的自然语言处理任务。
This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates. PG-19 is over double the size of the Billion Word benchmark and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark. Books are partitioned into a train, validation, and test set. Book metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date). Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text. To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table. One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA or NarrativeQA. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing.
提供机构:
Tanushreeeeee



