imperial-cpg/pile_arxiv_doc_mia_sequences
收藏Hugging Face2024-10-07 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/imperial-cpg/pile_arxiv_doc_mia_sequences
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: label
dtype: int64
- name: doc_idx
dtype: string
splits:
- name: train
num_bytes: 76605258
num_examples: 50000
download_size: 36641532
dataset_size: 76605258
---
# ArXiv papers from The Pile for document-level MIAs against for LLMs (split into sequences)
This dataset contains **sequences from** ArXiv papers randomly sampled from the train (members) and test (non-members) dataset from (the uncopyrighted version of) [the Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted).
We randomly sample 1,000 documents members and 1,000 non-members, ensuring that the selected documents have at least 5,000 words (any sequences of characters seperated by a white space).
This dataset contains the first 25 sequences of 200 words from all the documents made available in full [here](https://huggingface.co/datasets/imperial-cpg/pile_arxiv_doc_mia).
The dataset contains as columns:
- text: the raw text of the sequence
- label: binary label for membership (1=member)
- doc_idx: index allowing to group sequences to the same, original document
The dataset can be used to develop and evaluate document-level MIAs against LLMs trained on The Pile.
Target models include the suite of Pythia and GPTNeo models, to be found [here](https://huggingface.co/EleutherAI). Our understanding is that the deduplication executed on the Pile to create the "Pythia-dedup" models has been only done on the training dataset, suggesting this dataset of members/non-members also to be valid for these models.
For more information we refer to [the paper](https://arxiv.org/pdf/2406.17975).
提供机构:
imperial-cpg



