five

imperial-cpg/copyright-traps

收藏
Hugging Face2024-10-08 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/imperial-cpg/copyright-traps
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: perplexity_bucket dtype: int64 - name: text dtype: string - name: label dtype: int64 splits: - name: seq_len_25_n_rep_10 num_bytes: 107987 num_examples: 1000 - name: seq_len_25_n_rep_100 num_bytes: 108110 num_examples: 1000 - name: seq_len_25_n_rep_1000 num_bytes: 108285 num_examples: 1000 - name: seq_len_50_n_rep_10 num_bytes: 198293 num_examples: 1000 - name: seq_len_50_n_rep_100 num_bytes: 198133 num_examples: 1000 - name: seq_len_50_n_rep_1000 num_bytes: 198868 num_examples: 1000 - name: seq_len_100_n_rep_10 num_bytes: 385926 num_examples: 1000 - name: seq_len_100_n_rep_100 num_bytes: 386468 num_examples: 1000 - name: seq_len_100_n_rep_1000 num_bytes: 387679 num_examples: 1000 download_size: 1494187 dataset_size: 2079749 configs: - config_name: default data_files: - split: seq_len_25_n_rep_10 path: data/seq_len_25_n_rep_10-* - split: seq_len_25_n_rep_100 path: data/seq_len_25_n_rep_100-* - split: seq_len_25_n_rep_1000 path: data/seq_len_25_n_rep_1000-* - split: seq_len_50_n_rep_10 path: data/seq_len_50_n_rep_10-* - split: seq_len_50_n_rep_100 path: data/seq_len_50_n_rep_100-* - split: seq_len_50_n_rep_1000 path: data/seq_len_50_n_rep_1000-* - split: seq_len_100_n_rep_10 path: data/seq_len_100_n_rep_10-* - split: seq_len_100_n_rep_100 path: data/seq_len_100_n_rep_100-* - split: seq_len_100_n_rep_1000 path: data/seq_len_100_n_rep_1000-* --- # Copyright Traps Copyright traps (see [Meeus et al. (ICML 2024)](https://arxiv.org/pdf/2402.09363)) are unique, synthetically generated sequences who have been included into the training dataset of [CroissantLLM](https://huggingface.co/croissantllm/CroissantLLMBase). This dataset allows for the evaluation of Membership Inference Attacks (MIAs) using CroissantLLM as target model, where the goal is to infer whether a certain trap sequence was either included in or excluded from the training data. This dataset contains non-member (`label=0`) and member (`label=1`) trap sequences, which have been generated using [this code](https://github.com/computationalprivacy/copyright-traps) and by sampling text from [LLaMA-2 7B](https://huggingface.co/meta-llama/Llama-2-7b) while controlling for sequence length and perplexity. The dataset contains splits according to `seq_len_{XX}_n_rep_{YY}` where sequences of `XX={25,50,100}` tokens are considered and `YY={10, 100, 1000}` number of repetitions for member sequences. Each dataset also contains the 'perplexity bucket' for each trap sequence, where the original paper showed that higher perplexity sequences tend to be more vulnerable. Note that for a fixed sequence length, and across various number of repetitions, each split contains the same set of non-member sequences (`n_rep=0`). Also additional non-members generated in exactly the same way are provided [here](https://huggingface.co/datasets/imperial-cpg/copyright-traps-extra-non-members), which might be required for some MIA methodologies making additional assumptions for the attacker. If this dataset was useful for your work, kindly cite: ``` @inproceedings{meeuscopyright, title={Copyright Traps for Large Language Models}, author={Meeus, Matthieu and Shilov, Igor and Faysse, Manuel and de Montjoye, Yves-Alexandre}, booktitle={Forty-first International Conference on Machine Learning} } ```
提供机构:
imperial-cpg
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作