imperial-cpg/pile_arxiv_doc_mia
收藏Hugging Face2024-10-07 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/imperial-cpg/pile_arxiv_doc_mia
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 146613669
num_examples: 2000
download_size: 67134534
dataset_size: 146613669
---
# ArXiv papers from The Pile for document-level MIAs against LLMs
This dataset contains **full** ArXiv papers randomly sampled from the train (members) and test (non-members) dataset from (the uncopyrighted version of) [the Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted).
We randomly sample 1,000 documents members and 1,000 non-members, ensuring that the selected documents have at least 5,000 words (any sequences of characters seperated by a white space).
We also provide the dataset where each document is split into 25 sequences of 200 words [here](https://huggingface.co/datasets/imperial-cpg/pile_arxiv_doc_mia_sequences).
The dataset contains as columns:
- text: the raw text of the sequence
- label: binary label for membership (1=member)
The dataset can be used to develop and evaluate document-level MIAs against LLMs trained on The Pile.
Target models include the suite of Pythia and GPTNeo models, to be found [here](https://huggingface.co/EleutherAI). Our understanding is that the deduplication executed on the Pile to create the "Pythia-dedup" models has been only done on the training dataset, suggesting this dataset of members/non-members also to be valid for these models.
For more information we refer to [the paper](https://arxiv.org/pdf/2406.17975).
## 数据集信息
### 特征
- 名称:text,数据类型:字符串
- 名称:label,数据类型:64位整数
### 划分
- 名称:训练集,字节数:146613669,样本数量:2000
下载大小:67134534,数据集总大小:146613669
# 来自The Pile的ArXiv论文:面向大语言模型(Large Language Model, LLM)的文档级成员推断攻击数据集
本数据集完整收录了从[The Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted)(无版权版本)的训练集(成员样本)与测试集(非成员样本)中随机抽取的ArXiv论文。我们分别随机采样1000份成员样本与1000份非成员样本,确保所选论文至少包含5000个词(即任意以空白字符分隔的字符序列)。此外,我们还在此处(https://huggingface.co/datasets/imperial-cpg/pile_arxiv_doc_mia_sequences)提供了将每份文档拆分为25段、每段200个词的数据集版本。
本数据集包含以下列:
- text:对应序列的原始文本
- label:成员身份二元标签(1代表属于成员样本)
本数据集可用于开发与评估针对基于The Pile训练的大语言模型的文档级成员推断攻击(Membership Inference Attack, MIA)。目标模型包括EleutherAI开源的Pythia与GPTNeo系列模型,可在此处(https://huggingface.co/EleutherAI)获取。我们了解到,The Pile为构建"Pythia-dedup"模型所执行的去重操作仅针对训练数据集,因此本成员/非成员数据集同样适用于此类模型。
如需了解更多细节,请参考[相关论文](https://arxiv.org/pdf/2406.17975)。
提供机构:
imperial-cpg



