five

imperial-cpg/pile_arxiv_doc_mia

收藏
Hugging Face2024-10-07 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/imperial-cpg/pile_arxiv_doc_mia
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 146613669 num_examples: 2000 download_size: 67134534 dataset_size: 146613669 --- # ArXiv papers from The Pile for document-level MIAs against LLMs This dataset contains **full** ArXiv papers randomly sampled from the train (members) and test (non-members) dataset from (the uncopyrighted version of) [the Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted). We randomly sample 1,000 documents members and 1,000 non-members, ensuring that the selected documents have at least 5,000 words (any sequences of characters seperated by a white space). We also provide the dataset where each document is split into 25 sequences of 200 words [here](https://huggingface.co/datasets/imperial-cpg/pile_arxiv_doc_mia_sequences). The dataset contains as columns: - text: the raw text of the sequence - label: binary label for membership (1=member) The dataset can be used to develop and evaluate document-level MIAs against LLMs trained on The Pile. Target models include the suite of Pythia and GPTNeo models, to be found [here](https://huggingface.co/EleutherAI). Our understanding is that the deduplication executed on the Pile to create the "Pythia-dedup" models has been only done on the training dataset, suggesting this dataset of members/non-members also to be valid for these models. For more information we refer to [the paper](https://arxiv.org/pdf/2406.17975).

## 数据集信息 ### 特征 - 名称:text,数据类型:字符串 - 名称:label,数据类型:64位整数 ### 划分 - 名称:训练集,字节数:146613669,样本数量:2000 下载大小:67134534,数据集总大小:146613669 # 来自The Pile的ArXiv论文:面向大语言模型(Large Language Model, LLM)的文档级成员推断攻击数据集 本数据集完整收录了从[The Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted)(无版权版本)的训练集(成员样本)与测试集(非成员样本)中随机抽取的ArXiv论文。我们分别随机采样1000份成员样本与1000份非成员样本,确保所选论文至少包含5000个词(即任意以空白字符分隔的字符序列)。此外,我们还在此处(https://huggingface.co/datasets/imperial-cpg/pile_arxiv_doc_mia_sequences)提供了将每份文档拆分为25段、每段200个词的数据集版本。 本数据集包含以下列: - text:对应序列的原始文本 - label:成员身份二元标签(1代表属于成员样本) 本数据集可用于开发与评估针对基于The Pile训练的大语言模型的文档级成员推断攻击(Membership Inference Attack, MIA)。目标模型包括EleutherAI开源的Pythia与GPTNeo系列模型,可在此处(https://huggingface.co/EleutherAI)获取。我们了解到,The Pile为构建"Pythia-dedup"模型所执行的去重操作仅针对训练数据集,因此本成员/非成员数据集同样适用于此类模型。 如需了解更多细节,请参考[相关论文](https://arxiv.org/pdf/2406.17975)。
提供机构:
imperial-cpg
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作