imperial-cpg/pile_arxiv_doc_mia_sequences

Name: imperial-cpg/pile_arxiv_doc_mia_sequences
Creator: imperial-cpg
Published: 2024-10-07 18:31:01
License: 暂无描述

Hugging Face2024-10-07 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/imperial-cpg/pile_arxiv_doc_mia_sequences

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: label dtype: int64 - name: doc_idx dtype: string splits: - name: train num_bytes: 76605258 num_examples: 50000 download_size: 36641532 dataset_size: 76605258 --- # ArXiv papers from The Pile for document-level MIAs against for LLMs (split into sequences) This dataset contains **sequences from** ArXiv papers randomly sampled from the train (members) and test (non-members) dataset from (the uncopyrighted version of) [the Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted). We randomly sample 1,000 documents members and 1,000 non-members, ensuring that the selected documents have at least 5,000 words (any sequences of characters seperated by a white space). This dataset contains the first 25 sequences of 200 words from all the documents made available in full [here](https://huggingface.co/datasets/imperial-cpg/pile_arxiv_doc_mia). The dataset contains as columns: - text: the raw text of the sequence - label: binary label for membership (1=member) - doc_idx: index allowing to group sequences to the same, original document The dataset can be used to develop and evaluate document-level MIAs against LLMs trained on The Pile. Target models include the suite of Pythia and GPTNeo models, to be found [here](https://huggingface.co/EleutherAI). Our understanding is that the deduplication executed on the Pile to create the "Pythia-dedup" models has been only done on the training dataset, suggesting this dataset of members/non-members also to be valid for these models. For more information we refer to [the paper](https://arxiv.org/pdf/2406.17975).

提供机构：

imperial-cpg

5,000+

优质数据集

54 个

任务类型

进入经典数据集