imperial-cpg/pile_arxiv_doc_mia

Name: imperial-cpg/pile_arxiv_doc_mia
Creator: imperial-cpg
Published: 2024-10-07 18:31:22
License: 暂无描述

Hugging Face2024-10-07 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/imperial-cpg/pile_arxiv_doc_mia

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 146613669 num_examples: 2000 download_size: 67134534 dataset_size: 146613669 --- # ArXiv papers from The Pile for document-level MIAs against LLMs This dataset contains **full** ArXiv papers randomly sampled from the train (members) and test (non-members) dataset from (the uncopyrighted version of) [the Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted). We randomly sample 1,000 documents members and 1,000 non-members, ensuring that the selected documents have at least 5,000 words (any sequences of characters seperated by a white space). We also provide the dataset where each document is split into 25 sequences of 200 words [here](https://huggingface.co/datasets/imperial-cpg/pile_arxiv_doc_mia_sequences). The dataset contains as columns: - text: the raw text of the sequence - label: binary label for membership (1=member) The dataset can be used to develop and evaluate document-level MIAs against LLMs trained on The Pile. Target models include the suite of Pythia and GPTNeo models, to be found [here](https://huggingface.co/EleutherAI). Our understanding is that the deduplication executed on the Pile to create the "Pythia-dedup" models has been only done on the training dataset, suggesting this dataset of members/non-members also to be valid for these models. For more information we refer to [the paper](https://arxiv.org/pdf/2406.17975).

## 数据集信息 ### 特征 - 名称：text，数据类型：字符串 - 名称：label，数据类型：64位整数 ### 划分 - 名称：训练集，字节数：146613669，样本数量：2000 下载大小：67134534，数据集总大小：146613669 # 来自The Pile的ArXiv论文：面向大语言模型（Large Language Model, LLM）的文档级成员推断攻击数据集本数据集完整收录了从[The Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted)（无版权版本）的训练集（成员样本）与测试集（非成员样本）中随机抽取的ArXiv论文。我们分别随机采样1000份成员样本与1000份非成员样本，确保所选论文至少包含5000个词（即任意以空白字符分隔的字符序列）。此外，我们还在此处（https://huggingface.co/datasets/imperial-cpg/pile_arxiv_doc_mia_sequences）提供了将每份文档拆分为25段、每段200个词的数据集版本。本数据集包含以下列： - text：对应序列的原始文本 - label：成员身份二元标签（1代表属于成员样本）本数据集可用于开发与评估针对基于The Pile训练的大语言模型的文档级成员推断攻击（Membership Inference Attack, MIA）。目标模型包括EleutherAI开源的Pythia与GPTNeo系列模型，可在此处（https://huggingface.co/EleutherAI）获取。我们了解到，The Pile为构建"Pythia-dedup"模型所执行的去重操作仅针对训练数据集，因此本成员/非成员数据集同样适用于此类模型。如需了解更多细节，请参考[相关论文](https://arxiv.org/pdf/2406.17975)。

提供机构：

imperial-cpg

5,000+

优质数据集

54 个

任务类型

进入经典数据集