imperial-cpg/copyright-traps

Name: imperial-cpg/copyright-traps
Creator: imperial-cpg
Published: 2024-10-08 13:54:51
License: 暂无描述

Hugging Face2024-10-08 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/imperial-cpg/copyright-traps

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: perplexity_bucket dtype: int64 - name: text dtype: string - name: label dtype: int64 splits: - name: seq_len_25_n_rep_10 num_bytes: 107987 num_examples: 1000 - name: seq_len_25_n_rep_100 num_bytes: 108110 num_examples: 1000 - name: seq_len_25_n_rep_1000 num_bytes: 108285 num_examples: 1000 - name: seq_len_50_n_rep_10 num_bytes: 198293 num_examples: 1000 - name: seq_len_50_n_rep_100 num_bytes: 198133 num_examples: 1000 - name: seq_len_50_n_rep_1000 num_bytes: 198868 num_examples: 1000 - name: seq_len_100_n_rep_10 num_bytes: 385926 num_examples: 1000 - name: seq_len_100_n_rep_100 num_bytes: 386468 num_examples: 1000 - name: seq_len_100_n_rep_1000 num_bytes: 387679 num_examples: 1000 download_size: 1494187 dataset_size: 2079749 configs: - config_name: default data_files: - split: seq_len_25_n_rep_10 path: data/seq_len_25_n_rep_10-* - split: seq_len_25_n_rep_100 path: data/seq_len_25_n_rep_100-* - split: seq_len_25_n_rep_1000 path: data/seq_len_25_n_rep_1000-* - split: seq_len_50_n_rep_10 path: data/seq_len_50_n_rep_10-* - split: seq_len_50_n_rep_100 path: data/seq_len_50_n_rep_100-* - split: seq_len_50_n_rep_1000 path: data/seq_len_50_n_rep_1000-* - split: seq_len_100_n_rep_10 path: data/seq_len_100_n_rep_10-* - split: seq_len_100_n_rep_100 path: data/seq_len_100_n_rep_100-* - split: seq_len_100_n_rep_1000 path: data/seq_len_100_n_rep_1000-* --- # Copyright Traps Copyright traps (see [Meeus et al. (ICML 2024)](https://arxiv.org/pdf/2402.09363)) are unique, synthetically generated sequences who have been included into the training dataset of [CroissantLLM](https://huggingface.co/croissantllm/CroissantLLMBase). This dataset allows for the evaluation of Membership Inference Attacks (MIAs) using CroissantLLM as target model, where the goal is to infer whether a certain trap sequence was either included in or excluded from the training data. This dataset contains non-member (`label=0`) and member (`label=1`) trap sequences, which have been generated using [this code](https://github.com/computationalprivacy/copyright-traps) and by sampling text from [LLaMA-2 7B](https://huggingface.co/meta-llama/Llama-2-7b) while controlling for sequence length and perplexity. The dataset contains splits according to `seq_len_{XX}_n_rep_{YY}` where sequences of `XX={25,50,100}` tokens are considered and `YY={10, 100, 1000}` number of repetitions for member sequences. Each dataset also contains the 'perplexity bucket' for each trap sequence, where the original paper showed that higher perplexity sequences tend to be more vulnerable. Note that for a fixed sequence length, and across various number of repetitions, each split contains the same set of non-member sequences (`n_rep=0`). Also additional non-members generated in exactly the same way are provided [here](https://huggingface.co/datasets/imperial-cpg/copyright-traps-extra-non-members), which might be required for some MIA methodologies making additional assumptions for the attacker. If this dataset was useful for your work, kindly cite: ``` @inproceedings{meeuscopyright, title={Copyright Traps for Large Language Models}, author={Meeus, Matthieu and Shilov, Igor and Faysse, Manuel and de Montjoye, Yves-Alexandre}, booktitle={Forty-first International Conference on Machine Learning} } ```

提供机构：

imperial-cpg

5,000+

优质数据集

54 个

任务类型

进入经典数据集