imperial-cpg/project-gutenberg-extended

Name: imperial-cpg/project-gutenberg-extended
Creator: imperial-cpg
Published: 2024-07-26 17:29:14
License: 暂无描述

Hugging Face2024-07-26 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/imperial-cpg/project-gutenberg-extended

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en size_categories: - 1K<n<10K --- # Books from Project Gutenberg (after PG-19) This dataset contains 9,542 books collected from [Project Gutenberg](https://www.gutenberg.org/), an online library for free e-books. Specifically, we collect books that have been added to Project Gutenberg after the last book that has been included in the widely used [PG-19 dataset](https://huggingface.co/datasets/deepmind/pg19). Of all books included in PG-19, the latest release date on Project Gutenberg was February 10, 2019. We use an [open source library](https://github.com/kpully/gutenberg_scraper) to download all English books that were added to Project Gutenberg after this date (and adapt the code [here](https://github.com/computationalprivacy/document-level-membership-inference/tree/main/data/raw_gutenberg)). As preprocessing, we only consider the text between the explicit start and end of the uniformly formatted text files. This data has been collected as part of the experimental setup of the paper *"Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models"* ([link](https://arxiv.org/pdf/2310.15007)). The goal was to create a dataset containing representative *non-member* documents compared to PG-19 to develop and evaluate a Membership Inference Attack (MIA) against a Large Language Model (LLM) trained on data containing PG-19. We here release the data we have used to generate the results discussed in the paper, mainly to facilitate further research in similar directions. Importantly, research beyond the study in the paper ([here](https://arxiv.org/pdf/2406.17975)) suggests that this dataset exhibits a serious distribution shift in language compared to books in PG19. Hence, it is not recommended to use this data -at least not in its current form- as non-member data to develop and evaluate post-hoc MIAs against LLMs. Of course, the dataset also represents a rich source of natural language from the literature -most of which should be in the public domain in the US- and could also be used for other purposes. If you found this dataset helpful for your work, kindly cite us as ``` @article{meeus2023did, title={Did the neurons read your book? document-level membership inference for large language models}, author={Meeus, Matthieu and Jain, Shubham and Rei, Marek and de Montjoye, Yves-Alexandre}, journal={arXiv preprint arXiv:2310.15007}, year={2023} } ```

提供机构：

imperial-cpg

5,000+

优质数据集

54 个

任务类型

进入经典数据集